Thanks for bringing this up! Could you share the motivation where this distinction is important in the context of transfer across the C data interface? The "struct == record batch" concept has always made sense to me because in R, a data.frame can have a column that is also a data.frame and there is no distinction between the two. It seems like it may cause some ambiguous situations...should C++'s ImportArray() error, for example, if the schema has a ARROW_FLAG_RECORD_BATCH flag?
Cheers, -dewey On Fri, Apr 19, 2024 at 6:34 PM Matt Topol <zotthewiz...@gmail.com> wrote: > > Hey everyone, > > With some of the other developments surrounding libraries adopting the > Arrow C Data interfaces, there's been a consistent question about handling > tables (record batch) vs columns vs scalars. > > Right now, a Record Batch is sent through the C interface as a struct > column whose children are the individual columns of the batch and a Scalar > would be sent through as just an array of length 1. Applications would have > to create their own contextual way of indicating whether the Array being > passed should be interpreted as just a single array/column or should be > treated as a full table/record batch. > > Rather than introducing new members or otherwise complicating the structs, > I wanted to gauge how people felt about introducing new flags for the > ArrowSchema object. > > Right now, we only have 3 defined flags: > > ARROW_FLAG_DICTIONARY_ORDERED > ARROW_FLAG_NULLABLE > ARROW_FLAG_MAP_KEYS_SORTED > > The flags member of the struct is an int64, so we have another 61 bits to > play with! If no one has any strong objections, I wanted to propose adding > at least 2 new flags: > > ARROW_FLAG_RECORD_BATCH > ARROW_FLAG_SINGLE_COLUMN > > If neither flag is set, then it is contextual as to whether it should be > expected that the corresponding data is a table or a single column. If > ARROW_FLAG_RECORD_BATCH is set, then the corresponding data MUST be a > struct array and should be interpreted as a record batch by any consumers > (erroring otherwise). If ARROW_FLAG_SINGLE_COLUMN is set, then the > corresponding ArrowArray should be interpreted and utilized as a single > array/column regardless of its type. > > This provides a standardized way for producers of Arrow data to indicate in > the schema to consumers how the data they produced should be used (as a > table or column) rather than forcing everyone to come up with their own > contextualized way of handling things (extra arguments, differently named > functions for RecordBatch / Array, etc.). > > If there's no objections to this, I'll take a pass at implementing these > flags in C++ and Go to put up a PR and make a Vote thread. I just wanted to > see what others on the mailing list thought before I go ahead and put > effort into this. > > Thanks everyone! Take care! > > --Matt