Re: [DISCUSSION] New Flags for Arrow C Interface Schema

Matt Topol Wed, 15 May 2024 12:31:26 -0700

I like Antoine's idea of the flags being advisory in nature rather than
required in any way. If we add specific values for these flags to the spec
while explicitly stating that they are advisory flags and are free to be
ignored by consumers that don't want to care about the producer's intent,
would that address most of the issues that people have brought up so far?
Such as the "should these systems reject a record batch?" or "should they
reject a scalar?" questions would be up to the individual system to decide
whether or not it wants to care about the producer's intent. For systems
that simply don't have a concept that applies, they can ignore these flags
and import as usual.


There could definitely be benefits to an `ImportDatum` function as Antoine
put it, for certain data transfer scenarios such as UDFs and otherwise.

What does everyone think?

--Matt

On Tue, May 14, 2024 at 12:16 PM Antoine Pitrou <anto...@python.org> wrote:

>
> I think these flags should be advisory and consumers should be free to
> ignore them. However, some consumers apparently would benefit from them
> to more faithfully represent the producer's intention.
>
> For example, in Arrow C++, we could perhaps have a ImportDatum function
> whose actual return type would depend on which flags are set (though I'm
> not sure what the default behavior, in the absence of any flags, should
> be).
>
> Regards
>
> Antoine.
>
>
> Le 25/04/2024 à 06:54, Weston Pace a écrit :
> > What should be done if a system doesn't have a record batch concept?  For
> > example, if I remember correctly, Velox works this way and only has a
> "row
> > vector" (struct array) but no equivalent to record batch.  Should these
> > systems reject a record batch or should they just accept it as a struct
> > array?
> >
> > What about ArrowArrayStream?  Must it always return "record batch" or can
> > it return single columns? Should the stream be homogenous or is it valid
> if
> > some arrays are single columns and some are record batches?
> >
> > If a scalar comes across on its own then what is the length of the
> scalar?
> > I think this might be a reason to prefer something like REE for scalars
> > since the length would be encoded along with the scalar.
> >
> >
> > On Wed, Apr 24, 2024 at 6:21 PM Keith Kraus
> <ke...@voltrondata.com.invalid>
> > wrote:
> >
> >>> I believe several array implementations (e.g., numpy, R) are able to
> >> broadcast/recycle a length-1 array. Run-end-encoding is also an option
> that
> >> would make that broadcast explicit without expanding the scalar.
> >>
> >> Some libraries behave this way, i.e. Polars, but others like Pandas and
> >> cuDF only broadcast up dimensions. I.E. scalars can be broadcast across
> >> columns or dataframes, columns can be broadcast across dataframes, but
> >> length 1 columns do not broadcast across columns where trying to add
> say a
> >> length 5 and length 1 column isn't valid but adding a length 5 column
> and a
> >> scalar is. Additionally, it differentiates between operations that are
> >> guaranteed to return a scalar, i.e. something like a reduction of
> `sum()`
> >> versus operations that can return a length 1 column depending on the
> data,
> >> i.e. `unique()`.
> >>
> >>> For UDFs: UDFs are a system-specific interface. Presumably, that
> >> interface can encode whether an Arrow array is meant to represent a
> column
> >> or scalar (or record batch or ...). Again, because Arrow doesn't define
> >> scalars (for now...) or UDFs, the UDF interface needs to layer its own
> >> semantics on top of Arrow.
> >>>
> >>> In other words, I don't think the C Data Interface was meant to be
> >> something where you can expect to _only_ pass the ArrowDeviceArray
> around
> >> and have it encode all the semantics for a particular system, right? The
> >> UDF example is something where the engine would pass an ArrowDeviceArray
> >> plus additional context.
> >>
> >> There's a growing trend in execution engines supporting UDFs of Arrow in
> >> and Arrow out, DuckDB, PySpark, DataFusion, etc. Many of them have
> >> different options of passing in RecordBatches vs Arrays where they
> >> currently rely on the Arrow library containers in order to differentiate
> >> them.
> >>
> >> Additionally, libcudf has some generic functions that currently use
> Arrow
> >> C++ containers (
> >>
> https://docs.rapids.ai/api/cudf/stable/libcudf_docs/api_docs/interop_arrow/
> >> )
> >> for differentiating between RecordBatches, Arrays, and Scalars which
> could
> >> be moved to using the C Data Interfaces, Polars has similar (
> >>
> https://docs.pola.rs/py-polars/html/reference/api/polars.from_arrow.html)
> >> that currently uses PyArrow containers, and you could imagine other
> >> DataFrame libraries having similar.
> >>
> >> Ultimately, there's a desire to be able to move Arrow data between
> >> different libraries, applications, frameworks, etc. and given Arrow
> >> implementations like C++, Rust, and Go have containers for
> RecordBatches,
> >> Arrays, and Scalars respectively, things have been built around and
> >> differentiated around the concepts. Maybe trying to differentiate this
> >> information at runtime isn't the correct path, but I believe there's a
> >> demonstrated desire for being able to differentiate things in a library
> >> agnostic way.
> >>
> >> On Tue, Apr 23, 2024 at 8:37 PM David Li <lidav...@apache.org> wrote:
> >>
> >>> For scalars: Arrow doesn't define scalars. They're an implementation
> >>> concept. (They may be a *useful* one, but if we want to define them
> more
> >>> generally, that's a separate discussion.)
> >>>
> >>> For UDFs: UDFs are a system-specific interface. Presumably, that
> >> interface
> >>> can encode whether an Arrow array is meant to represent a column or
> >> scalar
> >>> (or record batch or ...). Again, because Arrow doesn't define scalars
> >> (for
> >>> now...) or UDFs, the UDF interface needs to layer its own semantics on
> >> top
> >>> of Arrow.
> >>>
> >>> In other words, I don't think the C Data Interface was meant to be
> >>> something where you can expect to _only_ pass the ArrowDeviceArray
> around
> >>> and have it encode all the semantics for a particular system, right?
> The
> >>> UDF example is something where the engine would pass an
> ArrowDeviceArray
> >>> plus additional context.
> >>>
> >>>> since we can't determine which a given ArrowArray is on its own. In
> the
> >>>> libcudf situation, it came up with what happens if you pass a
> >> non-struct
> >>>> column to the from_arrow_device method which returns a cudf::table?
> >>> Should
> >>>> it error, or should it create a table with a single column?
> >>>
> >>> Presumably it should just error? I can see this being ambiguous if
> there
> >>> were an API that dynamically returned either a table or a column based
> on
> >>> the input shape (where before it would be less ambiguous since you'd
> >>> explicitly pass pa.RecordBatch or pa.Array, and now it would be
> ambiguous
> >>> since you only pass ArrowDeviceArray). But it doesn't sound like that's
> >> the
> >>> case?
> >>>
> >>> On Tue, Apr 23, 2024, at 11:15, Weston Pace wrote:
> >>>> I tend to agree with Dewey.  Using run-end-encoding to represent a
> >> scalar
> >>>> is clever and would keep the c data interface more compact.  Also, a
> >>> struct
> >>>> array is a superset of a record batch (assuming the metadata is kept
> in
> >>> the
> >>>> schema).  Consumers should always be able to deserialize into a struct
> >>>> array and then downcast to a record batch if that is what they want to
> >> do
> >>>> (raising an error if there happen to be nulls).
> >>>>
> >>>>> Depending on the function in question, it could be valid to pass a
> >>> struct
> >>>>> column vs a record batch with different results.
> >>>>
> >>>> Are there any concrete examples where this is the case?  The closest
> >>>> example I can think of is something like the `drop_nulls` function,
> >>> which,
> >>>> given a record batch, would choose to drop rows where any column is
> >> null
> >>>> and, given an array, only drops rows where the top-level struct is
> >> null.
> >>>> However, it might be clearer to just give the two functions different
> >>> names
> >>>> anyways.
> >>>>
> >>>> On Mon, Apr 22, 2024 at 1:01 PM Dewey Dunnington
> >>>> <de...@voltrondata.com.invalid> wrote:
> >>>>
> >>>>> Thank you for the background!
> >>>>>
> >>>>> I still wonder if these distinctions are the responsibility of the
> >>>>> ArrowSchema to communicate (although perhaps links to the specific
> >>>>> discussions would help highlight use-cases that I am not
> envisioning).
> >>>>> I think these distinctions are definitely important in the contexts
> >>>>> you mentioned; however, I am not sure that the FFI layer is going to
> >>>>> be helpful.
> >>>>>
> >>>>>> In the libcudf situation, it came up with what happens if you pass a
> >>>>> non-struct
> >>>>>> column to the from_arrow_device method which returns a cudf::table?
> >>>>> Should
> >>>>>> it error, or should it create a table with a single column?
> >>>>>
> >>>>> I suppose that I would have expected two functions (one to create a
> >>>>> table and one to create a column). As a consumer I can't envision a
> >>>>> situation where I would want to import an ArrowDeviceArray but where
> I
> >>>>> would want some piece of run-time information to decide what the
> >>>>> return type of the function would be? (With apologies if I am missing
> >>>>> a piece of the discussion).
> >>>>>
> >>>>>> If A and B have different lengths, this is invalid
> >>>>>
> >>>>> I believe several array implementations (e.g., numpy, R) are able to
> >>>>> broadcast/recycle a length-1 array. Run-end-encoding is also an
> option
> >>>>> that would make that broadcast explicit without expanding the scalar.
> >>>>>
> >>>>>> Depending on the function in question, it could be valid to pass a
> >>>>> struct column vs a record batch with different results.
> >>>>>
> >>>>> If this is an important distinction for an FFI signature of a UDF,
> >>>>> there would probably be a struct definition for the UDF where there
> >>>>> would be an opportunity to make this distinction (and perhaps others
> >>>>> that are relevant) without loading this concept onto the existing
> >>>>> structs.
> >>>>>
> >>>>>> If no flags are set, then the behavior shouldn't change
> >>>>>> from what it is now. If the ARROW_FLAG_RECORD_BATCH flag is set,
> >> then
> >>> it
> >>>>>> should error unless calling ImportRecordBatch.
> >>>>>
> >>>>> I am not sure I would have expected that (since a struct array has an
> >>>>> unambiguous interpretation as a record batch and as a user I've very
> >>>>> explicitly decided that I want one, since I'm using that function).
> >>>>>
> >>>>> In the other direction, I am not sure a producer would be able to set
> >>>>> these flags without breaking backwards compatibility with earlier
> >>>>> producers that did not set them (since earlier threads have suggested
> >>>>> that it is good practice to error when an unsupported flag is
> >>>>> encountered).
> >>>>>
> >>>>> On Sun, Apr 21, 2024 at 6:16 PM Matt Topol <zotthewiz...@gmail.com>
> >>> wrote:
> >>>>>>
> >>>>>> First, I forgot a flag in my examples. There should also be an
> >>>>>> ARROW_FLAG_SCALAR too!
> >>>>>>
> >>>>>> The motivation for this distinction came up from discussions during
> >>>>> adding
> >>>>>> support for ArrowDeviceArray to libcudf in order to better indicate
> >>> the
> >>>>>> difference between a cudf::table and a cudf::column which are
> >> handled
> >>>>> quite
> >>>>>> differently. This also relates to the fact that we currently need
> >>>>> external
> >>>>>> context like the explicit ImportArray() and ImportRecordBatch()
> >>> functions
> >>>>>> since we can't determine which a given ArrowArray is on its own. In
> >>> the
> >>>>>> libcudf situation, it came up with what happens if you pass a
> >>> non-struct
> >>>>>> column to the from_arrow_device method which returns a cudf::table?
> >>>>> Should
> >>>>>> it error, or should it create a table with a single column?
> >>>>>>
> >>>>>> The other motivation for this distinction is with UDFs in an engine
> >>> that
> >>>>>> uses the C data interface. When dealing with queries and engines, it
> >>>>>> becomes important to be able to distinguish between a record batch,
> >> a
> >>>>>> column and a scalar. For example, take the expression A + B:
> >>>>>>
> >>>>>> If A and B have different lengths, this is invalid..... unless one
> >> of
> >>>>> them
> >>>>>> is a Scalar. This is because Scalars are broadcastable, columns are
> >>> not.
> >>>>>>
> >>>>>> Depending on the function in question, it could be valid to pass a
> >>> struct
> >>>>>> column vs a record batch with different results. It also resolves
> >> some
> >>>>>> ambiguity for UDFs and processing. For instance, given a single
> >>>>> ArrowArray
> >>>>>> of length 1, which is a struct: Is that a Struct Column? A Record
> >>> Batch?
> >>>>> or
> >>>>>> is it a scalar? There's no way to know what the producer's intention
> >>> was
> >>>>> or
> >>>>>> the context without these flags or having to side-channel the
> >>> information
> >>>>>> somehow.
> >>>>>>
> >>>>>>> It seems like it may cause some ambiguous
> >>>>>> situations...should C++'s ImportArray() error, for example, if the
> >>>>>> schema has a ARROW_FLAG_RECORD_BATCH flag?
> >>>>>>
> >>>>>> I would argue yes. If no flags are set, then the behavior shouldn't
> >>>>> change
> >>>>>> from what it is now. If the ARROW_FLAG_RECORD_BATCH flag is set,
> >> then
> >>> it
> >>>>>> should error unless calling ImportRecordBatch. It allows the
> >> producer
> >>> to
> >>>>>> provide context as to the source and intention of the structure of
> >> the
> >>>>> data.
> >>>>>>
> >>>>>> --Matt
> >>>>>>
> >>>>>> On Fri, Apr 19, 2024 at 8:24 PM Dewey Dunnington
> >>>>>> <de...@voltrondata.com.invalid> wrote:
> >>>>>>
> >>>>>>> Thanks for bringing this up!
> >>>>>>>
> >>>>>>> Could you share the motivation where this distinction is important
> >>> in
> >>>>>>> the context of transfer across the C data interface? The "struct
> >> ==
> >>>>>>> record batch" concept has always made sense to me because in R, a
> >>>>>>> data.frame can have a column that is also a data.frame and there
> >> is
> >>> no
> >>>>>>> distinction between the two. It seems like it may cause some
> >>> ambiguous
> >>>>>>> situations...should C++'s ImportArray() error, for example, if the
> >>>>>>> schema has a ARROW_FLAG_RECORD_BATCH flag?
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>>
> >>>>>>> -dewey
> >>>>>>>
> >>>>>>> On Fri, Apr 19, 2024 at 6:34 PM Matt Topol <
> >> zotthewiz...@gmail.com>
> >>>>> wrote:
> >>>>>>>>
> >>>>>>>> Hey everyone,
> >>>>>>>>
> >>>>>>>> With some of the other developments surrounding libraries
> >> adopting
> >>>>> the
> >>>>>>>> Arrow C Data interfaces, there's been a consistent question
> >> about
> >>>>>>> handling
> >>>>>>>> tables (record batch) vs columns vs scalars.
> >>>>>>>>
> >>>>>>>> Right now, a Record Batch is sent through the C interface as a
> >>> struct
> >>>>>>>> column whose children are the individual columns of the batch
> >> and
> >>> a
> >>>>>>> Scalar
> >>>>>>>> would be sent through as just an array of length 1. Applications
> >>>>> would
> >>>>>>> have
> >>>>>>>> to create their own contextual way of indicating whether the
> >> Array
> >>>>> being
> >>>>>>>> passed should be interpreted as just a single array/column or
> >>> should
> >>>>> be
> >>>>>>>> treated as a full table/record batch.
> >>>>>>>>
> >>>>>>>> Rather than introducing new members or otherwise complicating
> >> the
> >>>>>>> structs,
> >>>>>>>> I wanted to gauge how people felt about introducing new flags
> >> for
> >>> the
> >>>>>>>> ArrowSchema object.
> >>>>>>>>
> >>>>>>>> Right now, we only have 3 defined flags:
> >>>>>>>>
> >>>>>>>> ARROW_FLAG_DICTIONARY_ORDERED
> >>>>>>>> ARROW_FLAG_NULLABLE
> >>>>>>>> ARROW_FLAG_MAP_KEYS_SORTED
> >>>>>>>>
> >>>>>>>> The flags member of the struct is an int64, so we have another
> >> 61
> >>>>> bits to
> >>>>>>>> play with! If no one has any strong objections, I wanted to
> >>> propose
> >>>>>>> adding
> >>>>>>>> at least 2 new flags:
> >>>>>>>>
> >>>>>>>> ARROW_FLAG_RECORD_BATCH
> >>>>>>>> ARROW_FLAG_SINGLE_COLUMN
> >>>>>>>>
> >>>>>>>> If neither flag is set, then it is contextual as to whether it
> >>>>> should be
> >>>>>>>> expected that the corresponding data is a table or a single
> >>> column.
> >>>>> If
> >>>>>>>> ARROW_FLAG_RECORD_BATCH is set, then the corresponding data MUST
> >>> be a
> >>>>>>>> struct array and should be interpreted as a record batch by any
> >>>>> consumers
> >>>>>>>> (erroring otherwise). If ARROW_FLAG_SINGLE_COLUMN is set, then
> >> the
> >>>>>>>> corresponding ArrowArray should be interpreted and utilized as a
> >>>>> single
> >>>>>>>> array/column regardless of its type.
> >>>>>>>>
> >>>>>>>> This provides a standardized way for producers of Arrow data to
> >>>>> indicate
> >>>>>>> in
> >>>>>>>> the schema to consumers how the data they produced should be
> >> used
> >>>>> (as a
> >>>>>>>> table or column) rather than forcing everyone to come up with
> >>> their
> >>>>> own
> >>>>>>>> contextualized way of handling things (extra arguments,
> >>> differently
> >>>>> named
> >>>>>>>> functions for RecordBatch / Array, etc.).
> >>>>>>>>
> >>>>>>>> If there's no objections to this, I'll take a pass at
> >> implementing
> >>>>> these
> >>>>>>>> flags in C++ and Go to put up a PR and make a Vote thread. I
> >> just
> >>>>> wanted
> >>>>>>> to
> >>>>>>>> see what others on the mailing list thought before I go ahead
> >> and
> >>> put
> >>>>>>>> effort into this.
> >>>>>>>>
> >>>>>>>> Thanks everyone! Take care!
> >>>>>>>>
> >>>>>>>> --Matt
> >>>>>>>
> >>>>>
> >>>
> >>
> >
>

Re: [DISCUSSION] New Flags for Arrow C Interface Schema

Reply via email to