Thanks for bringing this up!

Could you share the motivation where this distinction is important in
the context of transfer across the C data interface? The "struct ==
record batch" concept has always made sense to me because in R, a
data.frame can have a column that is also a data.frame and there is no
distinction between the two. It seems like it may cause some ambiguous
situations...should C++'s ImportArray() error, for example, if the
schema has a ARROW_FLAG_RECORD_BATCH flag?

Cheers,

-dewey

On Fri, Apr 19, 2024 at 6:34 PM Matt Topol <zotthewiz...@gmail.com> wrote:
>
> Hey everyone,
>
> With some of the other developments surrounding libraries adopting the
> Arrow C Data interfaces, there's been a consistent question about handling
> tables (record batch) vs columns vs scalars.
>
> Right now, a Record Batch is sent through the C interface as a struct
> column whose children are the individual columns of the batch and a Scalar
> would be sent through as just an array of length 1. Applications would have
> to create their own contextual way of indicating whether the Array being
> passed should be interpreted as just a single array/column or should be
> treated as a full table/record batch.
>
> Rather than introducing new members or otherwise complicating the structs,
> I wanted to gauge how people felt about introducing new flags for the
> ArrowSchema object.
>
> Right now, we only have 3 defined flags:
>
> ARROW_FLAG_DICTIONARY_ORDERED
> ARROW_FLAG_NULLABLE
> ARROW_FLAG_MAP_KEYS_SORTED
>
> The flags member of the struct is an int64, so we have another 61 bits to
> play with! If no one has any strong objections, I wanted to propose adding
> at least 2 new flags:
>
> ARROW_FLAG_RECORD_BATCH
> ARROW_FLAG_SINGLE_COLUMN
>
> If neither flag is set, then it is contextual as to whether it should be
> expected that the corresponding data is a table or a single column. If
> ARROW_FLAG_RECORD_BATCH is set, then the corresponding data MUST be a
> struct array and should be interpreted as a record batch by any consumers
> (erroring otherwise). If ARROW_FLAG_SINGLE_COLUMN is set, then the
> corresponding ArrowArray should be interpreted and utilized as a single
> array/column regardless of its type.
>
> This provides a standardized way for producers of Arrow data to indicate in
> the schema to consumers how the data they produced should be used (as a
> table or column) rather than forcing everyone to come up with their own
> contextualized way of handling things (extra arguments, differently named
> functions for RecordBatch / Array, etc.).
>
> If there's no objections to this, I'll take a pass at implementing these
> flags in C++ and Go to put up a PR and make a Vote thread. I just wanted to
> see what others on the mailing list thought before I go ahead and put
> effort into this.
>
> Thanks everyone! Take care!
>
> --Matt

Reply via email to