Brian: yes, you're correct. Sorry, I've been playing around with a couple different ways to extend things, and was conflating them when I wrote my response. For this proposal, the dimension must have the same size for all items in a given record batch.
As suggested by Francois and Wes, I will look into using the ExtensionType to implement this proposal. -Edward On Wed, Jul 31, 2019 at 1:00 PM Brian Hulette <[email protected]> wrote: > I'm a little confused about the proposal now. If the unknown dimension > doesn't have to be the same within a record batch, how would you be able to > deduce it with the approach you described (dividing the logical length of > the values array by the length of the record batch)? > > On Wed, Jul 31, 2019 at 8:24 AM Wes McKinney <[email protected]> wrote: > > > I agree this sounds like a good application for ExtensionType. At > > minimum, ExtensionType can be used to develop a working version of > > what you need to help guide further discussions. > > > > On Mon, Jul 29, 2019 at 2:29 PM Francois Saint-Jacques > > <[email protected]> wrote: > > > > > > Hello, > > > > > > if each record has a different size, then I suggest to just use a > > > Struct<Dim, List<T>> where Dim is a struct (or expand in the outer > > > struct). You can probably add your own logic with the recently > > > introduced ExtensionType [1]. > > > > > > François > > > [1] > > > https://github.com/apache/arrow/blob/f77c3427ca801597b572fb197b92b0133269049b/cpp/src/arrow/extension_type.h > > > > > > On Mon, Jul 29, 2019 at 3:15 PM Edward Loper > <[email protected]> > > wrote: > > > > > > > > The intention is that each individual record could have a different > > size. > > > > This could be consistent within a given batch, but wouldn't need to > be. > > > > For example, if I wanted to send a 3-channel image, but the image > size > > may > > > > vary for each record, then I could use > > > > FixedSizeList<FixedSizeList<FixedSizeList<Int8>[3]>[-1]>[-1]. > > > > > > > > On Mon, Jul 29, 2019 at 1:18 PM Brian Hulette <[email protected]> > > wrote: > > > > > > > > > This isn't really relevant but I feel compelled to point it out - > the > > > > > FixedSizeList type has actually been in the Arrow spec for a while, > > but it > > > > > was only implemented in JS and Java initially. It was implemented > in > > C++ > > > > > just a few months ago. > > > > > > > > > > > > > Thanks for the clarification -- I was going based on the blame > history > > for > > > > Layout.rst, but I guess it just didn't get officially documented > there > > > > until the c++ implementation was added. > > > > > > > > -Edward > > > > > > > > > > > > > On Mon, Jul 29, 2019 at 7:01 AM Edward Loper > > <[email protected]> > > > > > wrote: > > > > > > > > > > > The FixedSizeList type, which was added to Arrow a few months > ago, > > is an > > > > > > array where each slot contains a fixed-size sequence of values. > > It is > > > > > > specified as FixedSizeList<T>[N], where T is a child type and N > is > > a > > > > > signed > > > > > > int32 that specifies the length of each list. > > > > > > > > > > > > This is useful for encoding fixed-size tensors. E.g., if I have > a > > > > > 100x8x10 > > > > > > tensor, then I can encode it as > > > > > > FixedSizeList<FixedSizeList<FixedSizeList<byte>[10]>[8]>[100]. > > > > > > > > > > > > But I'm also interested in encoding tensors where some dimension > > sizes > > > > > are > > > > > > not known in advance. It seems to me that FixedSizeList could be > > > > > extended > > > > > > to support this fairly easily, by simply defining that N=-1 means > > "each > > > > > > array slot has the same length, but that length is not known in > > advance." > > > > > > So e.g. we could encode a 100x?x10 tensor as > > > > > > FixedSizeList<FixedSizeList<FixedSizeList<byte>[10]>[-1]>[100]. > > > > > > > > > > > > Since these N=-1 row-lengths are not encoded in the type, we need > > some > > > > > way > > > > > > to determine what they are. Luckily, every Field in the schema > > has a > > > > > > corresponding FieldNode in the message; and those FieldNodes can > > be used > > > > > to > > > > > > deduce the row lengths. In particular, the row length must be > > equal to > > > > > the > > > > > > length of the child node divided by the length of the > > FixedSizeList. > > > > > E.g., > > > > > > if we have a FixedSizeList<byte>[-1] array with the values [[1, > > 2], [3, > > > > > 4], > > > > > > [5, 6]] then the message representation is: > > > > > > > > > > > > * Length: 3, Null count: 0 > > > > > > * Null bitmap buffer: Not required > > > > > > * Values array (byte array): > > > > > > * Length: 6, Null count: 0 > > > > > > * Null bitmap buffer: Not required > > > > > > * Value buffer: [1, 2, 3, 4, 5, 6, <unspecified padding > bytes>] > > > > > > > > > > > > So we can deduce that the row length is 6/3=2. > > > > > > > > > > > > It looks to me like it would be fairly easy to add support for > > this. > > > > > E.g., > > > > > > in the FixedSizeListArray constructor in c++, if > > list_type()->list_size() > > > > > > is -1, then set list_size_ to values.length()/length. There > would > > be no > > > > > > changes to the schema.fbs/message.fbs files -- we would just be > > > > > assigning a > > > > > > meaning to something that's currently meaningless (having > > > > > > FixedSizeList.listSize=-1). > > > > > > > > > > > > If there's support for adding this to Arrow, then I could put > > together a > > > > > > PR. > > > > > > > > > > > > Thanks, > > > > > > -Edward > > > > > > > > > > > > P.S. Apologies if this gets posted twice -- I sent it out a > couple > > days > > > > > ago > > > > > > right before subscribing to the mailing list; but I don't see it > > on the > > > > > > archives, presumably because I wasn't subscribed yet when I sent > > it out. > > > > > > > > > > > > > >
