Hello Arrow devs, Just a quick note. To answer one of my earlier questions:
1. Is this array type currently only used in Velox? (not DuckDB like some > of the other new types?) What evidence do we have that it will become used > outside of Velox? > This type is also used by DuckDB. Found discussion today in a talk from Mark Raasveldt [1]. That does improve the case for adding this type in my eyes. Best, Will Jones [1] https://youtu.be/bZOvAKGkzpQ?t=1570 On Tue, Jun 6, 2023 at 7:40 PM Weston Pace <weston.p...@gmail.com> wrote: > > This implies that each canonical alternative layout would codify a > > primary layout as its "fallback." > > Yes, that was part of my proposal: > > > * A new layout, if it is semantically equivalent to another, is > considered an alternative layout > > Or, to phrase it another way. If there is not a "fallback" then it is not > an alternative layout. It's a brand new primary layout. I'd expect this > to be quite rare. I can't really even hypothesize any examples. I think > the only truly atomic layouts are fixed-width, list, and struct. > > > This seems reasonable but it opens > > up some cans of worms, such as how two components communicating > > through an Arrow interface would negotiate which layout is supported > > Most APIs that I'm aware of already do this. For example, > pyarrow.parquet.read_table has a "read_dictionary" property that can be > used to control whether or not a column is returned with the dictionary > encoding. There is no way (that I'm aware of) to get a column in REE > encoding today without explicitly requesting it. In fact, this could be as > simple as a boolean "use_advanced_features" flag although I would > discourage something so simplistic. The point is that arrow-compatible > software should, by default, emit types that are supported by all arrow > implementations. > > Of course, there is no way to enforce this, it's just a guideline / strong > recommendation on how software should behave if it wants to state "arrow > compatible" as a feature. > > On Tue, Jun 6, 2023 at 3:33 PM Ian Cook <ianmc...@apache.org> wrote: > > > Thanks Weston. That all sounds reasonable to me. > > > > > with the caveat that the primary layout must be emitted if the user > > does not specifically request the alternative layout > > > > This implies that each canonical alternative layout would codify a > > primary layout as its "fallback." This seems reasonable but it opens > > up some cans of worms, such as how two components communicating > > through an Arrow interface would negotiate which layout is supported. > > I suppose such details should be discussed in a separate thread, but I > > raise this here just to point out that it implies an expansion in the > > scope of what Arrow interfaces can do. > > > > On Tue, Jun 6, 2023 at 6:17 PM Weston Pace <weston.p...@gmail.com> > wrote: > > > > > > From Micah: > > > > > > > This sounds reasonable to me but my main concern is, I'm not sure > > there is > > > > a great mechanism to enforce canonical layouts don't somehow become > > > default > > > > (or the only implementation). > > > > > > I'm not sure I understand. Is the concern that an alternative layout > is > > > eventually > > > used more and more by implementations until it is used more often than > > the > > > primary > > > layouts? In that case I think that is ok and we can promote the > > alternative > > > to a primary layout. > > > > > > Or is the concern that some applications will only support the > > alternative > > > layouts and > > > not the primary layout? In that case I would argue the application is > > not > > > "arrow compatible". > > > I don't know that we prevent or enforce this today either. An author > can > > > always falsely > > > claim they support Arrow even if they are using their own bespoke > format. > > > > > > From Ian: > > > > > > > It seems to me that most projects that are implementing Arrow today > > > > are not aiming to provide complete coverage of Arrow; rather they are > > > > adopting Arrow because of its role as a standard and they are > > > > implementing only as much of the Arrow standard as they require to > > > > achieve some goal. I believe that such projects are important Arrow > > > > stakeholders, and I believe that this proposed notion of canonical > > > > alternative layouts will serve them well and will create efficiencies > > > > by standardizing implementations around a shared set of alternatives. > > > > > > > > However I think that the documentation for canonical alternative > > > > layouts should strongly encourage implementers to default to using > the > > > > primary layouts defined in the core spec and only use alternative > > > > layouts in cases where the primary layouts do not meet their needs. > > > > > > I'd maybe take a slightly harsher stance. I don't think an application > > > needs to > > > support all types. For example, an Arrow-native string processing > > library > > > might > > > not worry about the integer types. That would be fine. I think it > would > > > still > > > be fair to call it an "arrow compatible string processing library". > > > > > > However, an application must support primary layouts in addition to > > > alternative > > > layouts. For example, a string processing library that expects all > > strings > > > to be > > > delivered as a single buffer sequence of null-terminated strings would > > not > > > be "an > > > arrow compatible string processing library" unless it also fully > > supported > > > the > > > standard (lengths + data) variable-sized list layout for strings > defined > > at > > > [1]. > > > > > > In other words: > > > > > > * Only receives and emits alternative layouts - not arrow compatible > > > * Only receives and emits primary layouts - arrow compatible > > > * Receives and emits both primary and alternative layouts - arrow > > > compatible† > > > > > > † - with the caveat that the primary layout must be emitted if the user > > > does not > > > specifically request the alternative layout. > > > > > > [1] > > > > > > https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout > > > > > > On Tue, Jun 6, 2023 at 2:45 PM Ian Cook <ianmc...@apache.org> wrote: > > > > > > > To clarify why we cannot simply propose adding ListView as a new > > > > “canonical extension type”: The extension type mechanism in Arrow > > > > depends on the underlying data being organized in an existing Arrow > > > > layout—that way an implementation that does not support the extension > > > > type can still handle the underlying data. But ListView is a wholly > > > > new layout. > > > > > > > > I strongly agree with Weston’s idea that it is a good time for Arrow > > > > to introduce the notion of “canonical alternative layouts.” > > > > > > > > Taken together, I think that canonical extension types and canonical > > > > alternative layouts could serve as an “incubator” for proposed new > > > > representations. For example, if a proposed canonical alternative > > > > layout ends up being broadly adopted, then that will serve as a > signal > > > > that we should consider adding it as a primary layout in the core > > > > spec. > > > > > > > > It seems to me that most projects that are implementing Arrow today > > > > are not aiming to provide complete coverage of Arrow; rather they are > > > > adopting Arrow because of its role as a standard and they are > > > > implementing only as much of the Arrow standard as they require to > > > > achieve some goal. I believe that such projects are important Arrow > > > > stakeholders, and I believe that this proposed notion of canonical > > > > alternative layouts will serve them well and will create efficiencies > > > > by standardizing implementations around a shared set of alternatives. > > > > > > > > However I think that the documentation for canonical alternative > > > > layouts should strongly encourage implementers to default to using > the > > > > primary layouts defined in the core spec and only use alternative > > > > layouts in cases where the primary layouts do not meet their needs. > > > > > > > > > > > > On Sat, May 27, 2023 at 7:44 PM Micah Kornfield < > emkornfi...@gmail.com > > > > > > > wrote: > > > > > > > > > > This sounds reasonable to me but my main concern is, I'm not sure > > there > > > > is > > > > > a great mechanism to enforce canonical layouts don't somehow become > > > > default > > > > > (or the only implementation). > > > > > > > > > > Even for these new layouts, I think it might be worth rethinking > > binding > > > > a > > > > > layout into the schema versus having a different concept of > encoding > > (and > > > > > changing some of the corresponding data structures). > > > > > > > > > > > > > > > On Mon, May 22, 2023 at 10:37 AM Weston Pace < > weston.p...@gmail.com> > > > > wrote: > > > > > > > > > > > Trying to settle on one option is a fruitless endeavor. Each > type > > has > > > > pros > > > > > > and cons. I would also predict that the largest existing usage > of > > > > Arrow is > > > > > > shuttling data from one system to another. The newly proposed > > format > > > > > > doesn't appear to have any significant advantage for that use > case > > (if > > > > > > anything, the existing format is arguably better as it is more > > > > compact). > > > > > > > > > > > > I am very biased towards historical precedent and avoiding > breaking > > > > > > changes. > > > > > > > > > > > > We have "canonical extension types", perhaps it is time for > > "canonical > > > > > > alternative layouts". We could define it as such: > > > > > > > > > > > > * There are one or more primary layouts > > > > > > * Existing layouts are automatically considered primary > layouts, > > > > even if > > > > > > they wouldn't > > > > > > have been primary layouts initially (e.g. large list) > > > > > > * A new layout, if it is semantically equivalent to another, is > > > > considered > > > > > > an alternative layout > > > > > > * An alternative layout still has the same requirements for > > adoption > > > > (two > > > > > > implementations and a vote) > > > > > > * An implementation should not feel pressured to rush and > > implement > > > > the > > > > > > new layout. > > > > > > It would be good if they contribute in the discussion and > > > > consider the > > > > > > layout and vote > > > > > > if they feel it would be an acceptable design. > > > > > > * We can define and vote and approve as many canonical > alternative > > > > layouts > > > > > > as we want: > > > > > > * A canonical alternative layout should, at a minimum, have > some > > > > > > reasonable justification, such as improved performance for > > > > algorithm X > > > > > > * Arrow implementations MUST support the primary layouts > > > > > > * An Arrow implementation MAY support a canonical alternative, > > > > however: > > > > > > * An Arrow implementation MUST first support the primary > layout > > > > > > * An Arrow implementation MUST support conversion to/from the > > > > primary > > > > > > and canonical layout > > > > > > * An Arrow implementation's APIs MUST only provide data in the > > > > > > alternative > > > > > > layout if it is explicitly asked for (e.g. schema inference > > should > > > > > > prefer the primary layout). > > > > > > * We can still vote for new primary layouts (e.g. promoting a > > > > canonical > > > > > > alternative) but, in these > > > > > > votes we don't only consider the value (e.g. performance) of > > the > > > > layout > > > > > > but also the interoperability. > > > > > > In other words, a layout can only become a primary layout if > > there > > > > is > > > > > > significant evidence that most > > > > > > implementations plan to adopt it. > > > > > > > > > > > > This lets us evolve support for new layouts more naturally. We > can > > > > > > generally assume that users will not, initially, be aware of > these > > > > > > alternative layouts. However, everything will just work. They > may > > > > start > > > > > > to see a performance penalty stemming from a lack of support for > > these > > > > > > layouts. If this performance penalty becomes significant then > they > > > > will > > > > > > discover it and become aware of the problem. They can then ask > > > > whatever > > > > > > library they are using to add support for the alternative layout. > > As > > > > > > enough users find a need for it then libraries will add support. > > > > > > Eventually, enough libraries will support it that we can adopt it > > as a > > > > > > primary layout. > > > > > > > > > > > > Also, it allows libraries to adopt alternative layouts more > > > > aggressively if > > > > > > they would like while still hopefully ensuring that we eventually > > all > > > > > > converge on the same implementation of the alternative layout. > > > > > > > > > > > > On Mon, May 22, 2023 at 9:35 AM Will Jones < > > will.jones...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > Hello Arrow devs, > > > > > > > > > > > > > > I don't understand why we would start deprecating features in > the > > > > Arrow > > > > > > > > format. Even starting this talk might already be a bad idea > > > > PR-wise. > > > > > > > > > > > > > > > > > > > > > > I agree we don't want to make breaking changes to the Arrow > > format. > > > > But > > > > > > > several maintainers have already stated they have no interest > in > > > > > > > maintaining both list types with full compute functionality > > [1][2], > > > > so I > > > > > > > think it's very likely one list type or the other will be > > > > > > > implicitly preferred in the ecosystem if this data type was > > added. > > > > If > > > > > > > that's the case, I'd prefer that we agreed as a community which > > one > > > > > > should > > > > > > > be preferred. Maybe that's not the best path; it's just one way > > for > > > > us to > > > > > > > balance stability, maintenance burden, and relevance. > > > > > > > > > > > > > > Can someone help distill down the primary rationale and usecase > > for > > > > > > > > adding ArrayView to the Arrow Spec? > > > > > > > > > > > > > > > > > > > > > > Looking back at that old thread, I think one of the main > > motivations > > > > is > > > > > > to > > > > > > > try to prevent query engine implementers from feeling there is > a > > > > tradeoff > > > > > > > between having state-of-the-art performance and being > > Arrow-native. > > > > For > > > > > > > some of the new array types, we had both Velox and DuckDB use > > them, > > > > so it > > > > > > > was reasonable to expect they were innovations that might > > > > proliferate. > > > > > > I'm > > > > > > > not sure if the ArrayView is part of that. From Wes earlier > [3]: > > > > > > > > > > > > > > The idea is that in a world of data and query federation (for > > > > example, > > > > > > > > consider [1] where Arrow is being used as a data federation > > layer > > > > with > > > > > > > many > > > > > > > > query engines), we want to increase the amount of data > > in-flight > > > > and > > > > > > > > in-memory that is in Arrow format. So if query engines are > > having > > > > to > > > > > > > depart > > > > > > > > substantially from the Arrow format to get performance, then > > this > > > > > > > creates a > > > > > > > > potential lose-lose situation: * Depart from Arrow: get > better > > > > > > > performance > > > > > > > > but pay serialization costs to read and write Arrow (the > > > > performance > > > > > > and > > > > > > > > resource utilization benefits outweigh the serialization > > costs). > > > > This > > > > > > > puts > > > > > > > > additional pressure on query engines to build specialized > > > > components > > > > > > for > > > > > > > > solving problems rather than making use of off-the-shelf > > components > > > > > > that > > > > > > > > use Arrow. This has knock-on effects on ecosystem > > fragmentation. * > > > > Or > > > > > > use > > > > > > > > Arrow, and accept suboptimal query processing performance > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Will mentions one usecase is Velox consuming python UDF output, > > which > > > > > > seems > > > > > > > > to be mostly about how fast Velox can consume this format, > not > > how > > > > fast > > > > > > > it > > > > > > > > can be written. Are there other usecases? > > > > > > > > > > > > > > > > > > > > > > To be clear, I don't know if that's the use case they want. > > That's > > > > just > > > > > > me > > > > > > > speculating. > > > > > > > > > > > > > > I still have some questions myself: > > > > > > > > > > > > > > 1. Is this array type currently only used in Velox? (not DuckDB > > like > > > > some > > > > > > > of the other new types?) What evidence do we have that it will > > become > > > > > > used > > > > > > > outside of Velox? > > > > > > > 2. We already have three list types: list, large list (64-bit > > > > offsets), > > > > > > and > > > > > > > fixed size list. Do we think we will only want a view version > of > > the > > > > > > 32-bit > > > > > > > offset variable length list? Or are we potentially talking > about > > view > > > > > > > variants for all three? > > > > > > > > > > > > > > Best, > > > > > > > > > > > > > > Will Jones > > > > > > > > > > > > > > [1] > > https://lists.apache.org/thread/smn13j1rnt23mb3fwx75sw3f877k3nwx > > > > > > > [2] > > https://lists.apache.org/thread/cc4w3vs3foj1fmpq9x888k51so60ftr3 > > > > > > > [3] > > https://lists.apache.org/thread/mk2yn62y6l8qtngcs1vg2qtwlxzbrt8t > > > > > > > > > > > > > > On Mon, May 22, 2023 at 3:48 AM Andrew Lamb < > > al...@influxdata.com> > > > > > > wrote: > > > > > > > > > > > > > > > Can someone help distill down the primary rationale and > > usecase for > > > > > > > > adding ArrayView to the Arrow Spec? > > > > > > > > > > > > > > > > From the above discussions, the stated rationale seems to be > > fast > > > > > > > > (zero-copy) interchange with Velox. > > > > > > > > > > > > > > > > This thread has qualitatively enumerated the benefits of > > > > (offset+len) > > > > > > > > encoding over the existing Arrow ListArray (offets) approach, > > but I > > > > > > > haven't > > > > > > > > seen any performance measurements that might help us to gauge > > the > > > > > > > tradeoff > > > > > > > > in additional complexity vs runtime overhead. > > > > > > > > > > > > > > > > Will mentions one usecase is Velox consuming python UDF > output, > > > > which > > > > > > > seems > > > > > > > > to be mostly about how fast Velox can consume this format, > not > > how > > > > fast > > > > > > > it > > > > > > > > can be written. Are there other usecases? > > > > > > > > > > > > > > > > Do we have numbers showing how much overhead converting to > > /from > > > > > > Velox's > > > > > > > > internal representation and the existing ListArray adds? Has > > > > anyone in > > > > > > > > Velox land considered adding faster support for Arrow style > > > > ListArray > > > > > > > > encoding? > > > > > > > > > > > > > > > > > > > > > > > > Andrew > > > > > > > > > > > > > > > > On Mon, May 22, 2023 at 4:38 AM Antoine Pitrou < > > anto...@python.org > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > I don't understand why we would start deprecating features > > in the > > > > > > Arrow > > > > > > > > > format. Even starting this talk might already be a bad idea > > > > PR-wise. > > > > > > > > > > > > > > > > > > As for implementing conversions at the I/O boundary, it's a > > > > > > reasonably > > > > > > > > > policy, but it still requires work by implementors and it's > > not > > > > > > granted > > > > > > > > > that all consumers of the Arrow format will grow such > > conversions > > > > > > > > > if/when we add non-trivial types such as ListView or > > StringView. > > > > > > > > > > > > > > > > > > Regards > > > > > > > > > > > > > > > > > > Antoine. > > > > > > > > > > > > > > > > > > > > > > > > > > > Le 22/05/2023 à 00:39, Will Jones a écrit : > > > > > > > > > > One more thing: Looking back on the previous > discussion[1] > > > > (which > > > > > > > > Weston > > > > > > > > > > pointed out in his earlier message), Jorge suggested that > > the > > > > old > > > > > > > list > > > > > > > > > > types might be deprecated in favor of view variants [2]. > > Others > > > > > > were > > > > > > > > > > worried that it might undermine the perception that the > > Arrow > > > > > > format > > > > > > > is > > > > > > > > > > stable. I think it might be worth thinking about "soft > > > > deprecating" > > > > > > > the > > > > > > > > > old > > > > > > > > > > list type: suggesting new implementations prefer the list > > > > view, but > > > > > > > > > > reassuring that implementations should support the old > > format, > > > > even > > > > > > > if > > > > > > > > > they > > > > > > > > > > just convert to the new format. To be clear, this > wouldn't > > > > mean we > > > > > > > plan > > > > > > > > > to > > > > > > > > > > create breaking changes in the format; but if we ever did > > for > > > > other > > > > > > > > > > reasons, the old list type might go. > > > > > > > > > > > > > > > > > > > > Arrow compute libraries could choose either format for > > compute > > > > > > > support, > > > > > > > > > and > > > > > > > > > > plan to do conversion at the boundaries. Libraries that > use > > > > the new > > > > > > > > type > > > > > > > > > > will have cheap conversion when reading the old type. > > Meanwhile > > > > > > those > > > > > > > > > that > > > > > > > > > > are still on the old type will have some incentive to > move > > > > towards > > > > > > > the > > > > > > > > > new > > > > > > > > > > one, since that conversion will not be as efficient. > > > > > > > > > > > > > > > > > > > > [1] > > > > > > https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq > > > > > > > > > > [2] > > > > > > https://lists.apache.org/thread/smn13j1rnt23mb3fwx75sw3f877k3nwx > > > > > > > > > > > > > > > > > > > > On Sun, May 21, 2023 at 3:07 PM Will Jones < > > > > > > will.jones...@gmail.com> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > >> Hello, > > > > > > > > > >> > > > > > > > > > >> I think Sasha brings up a good point, that the > advantages > > of > > > > this > > > > > > > > format > > > > > > > > > >> seem to be primarily about query processing. Other > > encodings > > > > like > > > > > > > REE > > > > > > > > > and > > > > > > > > > >> dictionary have space-saving advantages that justify > them > > > > simply > > > > > > in > > > > > > > > > terms > > > > > > > > > >> of space efficiency (although they have query processing > > > > > > advantages > > > > > > > as > > > > > > > > > >> well). As discussed, most use cases are already well > > served by > > > > > > > > existing > > > > > > > > > >> list types and dictionary encoding. > > > > > > > > > >> > > > > > > > > > >> I agree that there are cases where transferring this > type > > > > without > > > > > > > > > >> conversion would be ideal. One use case I can think of > is > > if > > > > Velox > > > > > > > > > wants to > > > > > > > > > >> be able to take Arrow-based UDFs (possibly written with > > > > PyArrow, > > > > > > for > > > > > > > > > >> example) that operate on this column type and therefore > > wants > > > > > > > > zero-copy > > > > > > > > > >> exchange over the C Data Interface. > > > > > > > > > >> > > > > > > > > > >> One big question I have: we already have three list > types: > > > > list, > > > > > > > large > > > > > > > > > >> list (64-bit offsets), and fixed size list. Do we think > we > > > > will > > > > > > only > > > > > > > > > want a > > > > > > > > > >> view version of the 32-bit offset variable length list? > Or > > > > are we > > > > > > > > > >> potentially talking about view variants for all three? > > > > > > > > > >> > > > > > > > > > >> Best, > > > > > > > > > >> > > > > > > > > > >> Will Jones > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> On Sun, May 21, 2023 at 2:19 PM Felipe Oliveira > Carvalho < > > > > > > > > > >> felipe...@gmail.com> wrote: > > > > > > > > > >> > > > > > > > > > >>> The benefit of having a memory format that’s friendly > to > > > > > > > > > non-deterministic > > > > > > > > > >>> order writes is unlocked by the transport and > processing > > of > > > > the > > > > > > > data > > > > > > > > > being > > > > > > > > > >>> agnostic to the physical order as much as possible. > > > > > > > > > >>> > > > > > > > > > >>> Requiring a conversion could cancel out that benefit. > > But it > > > > can > > > > > > > be a > > > > > > > > > >>> provisory step for compatibility between systems that > > don’t > > > > > > > > understand > > > > > > > > > the > > > > > > > > > >>> format yet. This is similar to the situation with > > compression > > > > > > > schemes > > > > > > > > > like > > > > > > > > > >>> run-end encoding — the goal is processing the > compressed > > data > > > > > > > > directly > > > > > > > > > >>> without an expansion step whenever possible. > > > > > > > > > >>> > > > > > > > > > >>> This is why having it as part of the open Arrow format > > is so > > > > > > > > important: > > > > > > > > > >>> everyone can agree on a format that’s friendly to > > parallel > > > > and/or > > > > > > > > > >>> vectorized compute kernels without introducing multiple > > > > > > > incompatible > > > > > > > > > >>> formats to the ecosystem and without imposing a > > conversion > > > > step > > > > > > > > between > > > > > > > > > >>> the > > > > > > > > > >>> different systems. > > > > > > > > > >>> > > > > > > > > > >>> — > > > > > > > > > >>> Felipe > > > > > > > > > >>> > > > > > > > > > >>> On Sat, 20 May 2023 at 20:04 Aldrin > > > > <octalene....@pm.me.invalid> > > > > > > > > > wrote: > > > > > > > > > >>> > > > > > > > > > >>>> I don't feel like this representation is necessarily a > > > > detail of > > > > > > > the > > > > > > > > > >>> query > > > > > > > > > >>>> engine, but I am also not sure why this representation > > would > > > > > > have > > > > > > > to > > > > > > > > > be > > > > > > > > > >>>> converted to a non-view format when serializing. Could > > you > > > > > > clarify > > > > > > > > > >>> that? My > > > > > > > > > >>>> impression is that this representation could be used > for > > > > > > > persistence > > > > > > > > > or > > > > > > > > > >>>> data transfer, though it can be more complex to > > guarantee > > > > the > > > > > > > > portion > > > > > > > > > of > > > > > > > > > >>>> the buffer that an index points to is also present in > > > > memory. > > > > > > > > > >>>> > > > > > > > > > >>>> Sent from Proton Mail for iOS > > > > > > > > > >>>> > > > > > > > > > >>>> > > > > > > > > > >>>> On Sat, May 20, 2023 at 15:00, Sasha Krassovsky < > > > > > > > > > >>> krassovskysa...@gmail.com > > > > > > > > > >>>> > > > > <On+Sat,+May+20,+2023+at+15:00,+Sasha+Krassovsky+%3C%3Ca+href=>> > > > > > > > > > wrote: > > > > > > > > > >>>> > > > > > > > > > >>>> Hi everyone, > > > > > > > > > >>>> I understand that there are numerous benefits to this > > > > > > > representation > > > > > > > > > >>>> during query processing, but would it be fair to say > > that > > > > this > > > > > > is > > > > > > > an > > > > > > > > > >>>> implementation detail of the query engine? Query > engines > > > > don’t > > > > > > > > > >>> necessarily > > > > > > > > > >>>> need to conform to the Arrow format internally, only > at > > > > > > > > ingest/egress > > > > > > > > > >>>> points, and performing a conversion from the non-view > to > > > > view > > > > > > > format > > > > > > > > > >>> seems > > > > > > > > > >>>> like it would be very cheap (though I understand not > > > > necessarily > > > > > > > the > > > > > > > > > >>> other > > > > > > > > > >>>> way around, but you’d need to do that anyway if you’re > > > > > > > serializing). > > > > > > > > > >>>> > > > > > > > > > >>>> Sasha Krassovsky > > > > > > > > > >>>> > > > > > > > > > >>>>> 20 мая 2023 г., в 13:00, Will Jones < > > > > will.jones...@gmail.com> > > > > > > > > > >>>> написал(а): > > > > > > > > > >>>>> > > > > > > > > > >>>>> Thanks for sharing these details, Pedro. The > > conditional > > > > > > > branches > > > > > > > > > >>>> argument > > > > > > > > > >>>>> makes a lot of sense to me. > > > > > > > > > >>>>> > > > > > > > > > >>>>> The tensors point brings up some interesting issues. > > For > > > > now, > > > > > > > we've > > > > > > > > > >>>> defined > > > > > > > > > >>>>> our only tensor extension type to be built on a fixed > > size > > > > > > list. > > > > > > > > If a > > > > > > > > > >>> use > > > > > > > > > >>>>> case of this might be manipulating tensors with zero > > copy, > > > > > > > perhaps > > > > > > > > > >>> that > > > > > > > > > >>>>> suggests that we want a fixed size list variant? In > > > > addition, > > > > > > > would > > > > > > > > > we > > > > > > > > > >>>> have > > > > > > > > > >>>>> to define another extension type to be a ListView > > variant? > > > > Or > > > > > > > would > > > > > > > > > we > > > > > > > > > >>>> want > > > > > > > > > >>>>> to think about making extension types somehow valid > > across > > > > > > > various > > > > > > > > > >>>>> encodings of the same "logical type"? > > > > > > > > > >>>>> > > > > > > > > > >>>>>> On Fri, May 19, 2023 at 1:59 PM Pedro Eugenio Rocha > > > > Pedreira > > > > > > > > > >>>>>> <pedro...@meta.com.invalid> wrote: > > > > > > > > > >>>>>> > > > > > > > > > >>>>>> Hi all, > > > > > > > > > >>>>>> > > > > > > > > > >>>>>> This is Pedro from the Velox team at Meta. This is > my > > > > first > > > > > > time > > > > > > > > > >>> here, > > > > > > > > > >>>> so > > > > > > > > > >>>>>> nice to e-meet you! > > > > > > > > > >>>>>> > > > > > > > > > >>>>>> Adding to what Felipe said, the main reason we > created > > > > > > > “ListView” > > > > > > > > > >>>> (though > > > > > > > > > >>>>>> we just call them ArrayVector/MapVector in Velox) is > > that, > > > > > > along > > > > > > > > > with > > > > > > > > > >>>>>> StringViews for strings, they allow us to write any > > > > columnar > > > > > > > > buffer > > > > > > > > > >>>>>> out-or-order, regardless of their types or > encodings. > > > > This is > > > > > > > > > >>> naturally > > > > > > > > > >>>>>> doable for all primitive types (fixed-size), but not > > for > > > > types > > > > > > > > that > > > > > > > > > >>>> don’t > > > > > > > > > >>>>>> have fixed size and are required to be contiguous. > The > > > > > > > StringView > > > > > > > > > and > > > > > > > > > >>>>>> ListView formats allow us to keep this invariant in > > Velox. > > > > > > > > > >>>>>> > > > > > > > > > >>>>>> Being able to write vectors out-of-order is useful > > when > > > > > > > executing > > > > > > > > > >>>>>> conditionals like IF/SWITCH statements, which are > > > > pervasive > > > > > > > among > > > > > > > > > our > > > > > > > > > >>>>>> workloads. To fully vectorize it, one first > evaluates > > the > > > > > > > > > expression, > > > > > > > > > >>>> then > > > > > > > > > >>>>>> generate a bitmap containing which rows take the > THEN > > and > > > > > > which > > > > > > > > take > > > > > > > > > >>> the > > > > > > > > > >>>>>> ELSE branch. Then you populate all rows that match > the > > > > first > > > > > > > > branch > > > > > > > > > >>> by > > > > > > > > > >>>>>> evaluating the THEN expression in a vectorized > > > > (branch-less > > > > > > and > > > > > > > > > cache > > > > > > > > > >>>>>> friendly) way, and subsequently the ELSE branch. If > > you > > > > can’t > > > > > > > > write > > > > > > > > > >>> them > > > > > > > > > >>>>>> out-of-order, you would either have a big branch per > > row > > > > > > > > dispatching > > > > > > > > > >>> to > > > > > > > > > >>>> the > > > > > > > > > >>>>>> right expression (slow), or populate two distinct > > vectors > > > > then > > > > > > > > > >>> merging > > > > > > > > > >>>> them > > > > > > > > > >>>>>> at the end (potentially even slower). How much > faster > > our > > > > > > > approach > > > > > > > > > is > > > > > > > > > >>>>>> highly depends on the buffer sizes and expressions, > > but we > > > > > > found > > > > > > > > it > > > > > > > > > >>> to > > > > > > > > > >>>> be > > > > > > > > > >>>>>> faster enough on average to justify us extending the > > > > > > underlying > > > > > > > > > >>> layout. > > > > > > > > > >>>>>> > > > > > > > > > >>>>>> With that said, this is all within a single thread > of > > > > > > execution. > > > > > > > > > >>>>>> Parallelization is done by feeding each thread its > own > > > > > > > > vector/data. > > > > > > > > > >>> As > > > > > > > > > >>>>>> pointed out in a previous message, this also gives > > you the > > > > > > > > > >>> flexibility > > > > > > > > > >>>> to > > > > > > > > > >>>>>> implement cardinality increasing/reducing > operations, > > but > > > > we > > > > > > > don’t > > > > > > > > > >>> use > > > > > > > > > >>>> it > > > > > > > > > >>>>>> for that purpose. Operations like filtering, > joining, > > > > > > unnesting > > > > > > > > and > > > > > > > > > >>>> similar > > > > > > > > > >>>>>> are done by wrapping the internal vector in a > > dictionary, > > > > as > > > > > > > these > > > > > > > > > >>> need > > > > > > > > > >>>> to > > > > > > > > > >>>>>> work not only on “ListViews” but on any data types > > with > > > > any > > > > > > > > > encoding. > > > > > > > > > >>>> There > > > > > > > > > >>>>>> are more details on Section 4.2.1 in [1] > > > > > > > > > >>>>>> > > > > > > > > > >>>>>> Beyond this, it also gives function/kernel > developers > > more > > > > > > > > > >>> flexibility > > > > > > > > > >>>> to > > > > > > > > > >>>>>> implement operations that manipulate Arrays/Maps. > For > > > > example, > > > > > > > > > >>>> operations > > > > > > > > > >>>>>> that slice these containers can be implemented in a > > > > zero-copy > > > > > > > > manner > > > > > > > > > >>> by > > > > > > > > > >>>>>> just rearranging the lengths/offsets indices, > without > > ever > > > > > > > > touching > > > > > > > > > >>> the > > > > > > > > > >>>>>> larger internal buffers. This is a similar > motivation > > as > > > > for > > > > > > > > > >>> StringView > > > > > > > > > >>>>>> (think of substr(), trim(), and similar). One nice > > last > > > > > > property > > > > > > > > is > > > > > > > > > >>> that > > > > > > > > > >>>>>> this layout allows for overlapping ranges. This is > > > > something > > > > > > > > > >>> discussed > > > > > > > > > >>>> with > > > > > > > > > >>>>>> our ML people to allow deduping feature values in a > > tensor > > > > > > > (which > > > > > > > > is > > > > > > > > > >>>> fairly > > > > > > > > > >>>>>> common), but not something we have leveraged just > yet. > > > > > > > > > >>>>>> > > > > > > > > > >>>>>> [1] - > https://vldb.org/pvldb/vol15/p3372-pedreira.pdf > > > > > > > > > >>>>>> > > > > > > > > > >>>>>> Best, > > > > > > > > > >>>>>> -- > > > > > > > > > >>>>>> Pedro Pedreira > > > > > > > > > >>>>>> ________________________________ > > > > > > > > > >>>>>> From: Felipe Oliveira Carvalho <felipe...@gmail.com > > > > > > > > > > > >>>>>> Sent: Friday, May 19, 2023 10:01 AM > > > > > > > > > >>>>>> To: dev@arrow.apache.org <dev@arrow.apache.org> > > > > > > > > > >>>>>> Cc: Pedro Eugenio Rocha Pedreira <pedro...@meta.com > > > > > > > > > > > >>>>>> Subject: Re: [DISCUSS][Format] Starting the draft > > > > > > implementation > > > > > > > > of > > > > > > > > > >>> the > > > > > > > > > >>>>>> ArrayView array format > > > > > > > > > >>>>>> > > > > > > > > > >>>>>> +pedroerp On Thu, 11 May 2023 at 17: 51 Raphael > > > > Taylor-Davies > > > > > > > <r. > > > > > > > > > >>>>>> taylordavies@ googlemail. com. invalid> wrote: Hi > > All, > > > > > if > > > > > > we > > > > > > > > > added > > > > > > > > > >>>>>> this, do we think many Arrow and query > engine > > > > > > implementations > > > > > > > > (for > > > > > > > > > >>>>>> example, DataFusion) will be > > > > > > > > > >>>>>> ZjQcmQRYFpfptBannerStart > > > > > > > > > >>>>>> This Message Is From an External Sender > > > > > > > > > >>>>>> > > > > > > > > > >>>>>> ZjQcmQRYFpfptBannerEnd > > > > > > > > > >>>>>> +pedroerp > > > > > > > > > >>>>>> > > > > > > > > > >>>>>> On Thu, 11 May 2023 at 17:51 Raphael Taylor-Davies > > > > > > > > > >>>>>> <r.taylordav...@googlemail.com.invalid> wrote: > > > > > > > > > >>>>>> Hi All, > > > > > > > > > >>>>>> > > > > > > > > > >>>>>>> if we added this, do we think many Arrow and query > > > > > > > > > >>>>>>> engine implementations (for example, DataFusion) > > will be > > > > > > eager > > > > > > > to > > > > > > > > > >>> add > > > > > > > > > >>>>>> full > > > > > > > > > >>>>>>> support for the type, including compute kernels? Or > > are > > > > they > > > > > > > > likely > > > > > > > > > >>> to > > > > > > > > > >>>>>> just > > > > > > > > > >>>>>>> convert this type to ListArray at import > boundaries? > > > > > > > > > >>>>>> I can't speak for query engines in general, but at > > least > > > > for > > > > > > > > > arrow-rs > > > > > > > > > >>>>>> and by extension DataFusion, and based on my current > > > > > > > understanding > > > > > > > > > of > > > > > > > > > >>>>>> the use-cases I would be rather hesitant to add > > support > > > > to the > > > > > > > > > >>> kernels > > > > > > > > > >>>>>> for this array type, definitely instead favouring > > > > conversion > > > > > > at > > > > > > > > the > > > > > > > > > >>>>>> edges. We already have issues with the amount of > code > > > > > > generation > > > > > > > > > >>>>>> resulting in binary bloat and long compile times, > and > > I > > > > worry > > > > > > > this > > > > > > > > > >>> would > > > > > > > > > >>>>>> worsen this situation whilst not really providing > > > > compelling > > > > > > > > > >>> advantages > > > > > > > > > >>>>>> for the vast majority of workloads that don't > interact > > > > with > > > > > > > Velox. > > > > > > > > > >>>>>> Whilst I can definitely see that the ListView > > > > representation > > > > > > is > > > > > > > > > >>> probably > > > > > > > > > >>>>>> a better way to represent variable length lists than > > what > > > > > > arrow > > > > > > > > > >>> settled > > > > > > > > > >>>>>> upon, I'm not yet convinced it is sufficiently > better > > to > > > > > > > > incentivise > > > > > > > > > >>>>>> broad ecosystem adoption. > > > > > > > > > >>>>>> > > > > > > > > > >>>>>> Kind Regards, > > > > > > > > > >>>>>> > > > > > > > > > >>>>>> Raphael Taylor-Davies > > > > > > > > > >>>>>> > > > > > > > > > >>>>>>> On 11/05/2023 21:20, Will Jones wrote: > > > > > > > > > >>>>>>> Hi Felipe, > > > > > > > > > >>>>>>> > > > > > > > > > >>>>>>> Thanks for the additional details. > > > > > > > > > >>>>>>> > > > > > > > > > >>>>>>> > > > > > > > > > >>>>>>>> Velox kernels benefit from being able to append > > data to > > > > the > > > > > > > > array > > > > > > > > > >>> from > > > > > > > > > >>>>>>>> different threads without care for strict > ordering. > > > > Only the > > > > > > > > > >>> offsets > > > > > > > > > >>>>>> array > > > > > > > > > >>>>>>>> has to be written according to logical order but > > that is > > > > > > > > > >>> potentially a > > > > > > > > > >>>>>> much > > > > > > > > > >>>>>>>> smaller buffer than the values buffer. > > > > > > > > > >>>>>>>> > > > > > > > > > >>>>>>> It still seems to me like applications are still > > pretty > > > > > > niche, > > > > > > > > as I > > > > > > > > > >>>>>> suspect > > > > > > > > > >>>>>>> in most cases the benefits are outweighed by the > > costs. > > > > The > > > > > > > > benefit > > > > > > > > > >>>> here > > > > > > > > > >>>>>>> seems pretty limited: if you are trying to split > work > > > > between > > > > > > > > > >>> threads, > > > > > > > > > >>>>>>> usually you will have other levels such as array > > chunks > > > > to > > > > > > > > > >>> parallelize. > > > > > > > > > >>>>>> And > > > > > > > > > >>>>>>> if you have an incoming stream of row data, you'll > > want > > > > to > > > > > > > append > > > > > > > > > in > > > > > > > > > >>>>>>> predictable order to match the order of the other > > > > arrays. Am > > > > > > I > > > > > > > > > >>> missing > > > > > > > > > >>>>>>> something? > > > > > > > > > >>>>>>> > > > > > > > > > >>>>>>> And, IIUC, the cost of using ListView with > > out-of-order > > > > > > values > > > > > > > > over > > > > > > > > > >>>>>>> ListArray is you lose memory locality; the values > of > > > > element > > > > > > 2 > > > > > > > > are > > > > > > > > > >>> no > > > > > > > > > >>>>>>> longer adjacent to the values of element 1. What do > > you > > > > think > > > > > > > > about > > > > > > > > > >>>> that > > > > > > > > > >>>>>>> tradeoff? > > > > > > > > > >>>>>>> > > > > > > > > > >>>>>>> I don't mean to be difficult about this. I'm > excited > > for > > > > both > > > > > > > the > > > > > > > > > >>> REE > > > > > > > > > >>>> and > > > > > > > > > >>>>>>> StringView arrays, but this one I'm not so sure > about > > > > yet. I > > > > > > > > > suppose > > > > > > > > > >>>>>> what I > > > > > > > > > >>>>>>> am trying to ask is, if we added this, do we think > > many > > > > Arrow > > > > > > > and > > > > > > > > > >>> query > > > > > > > > > >>>>>>> engine implementations (for example, DataFusion) > > will be > > > > > > eager > > > > > > > to > > > > > > > > > >>> add > > > > > > > > > >>>>>> full > > > > > > > > > >>>>>>> support for the type, including compute kernels? Or > > are > > > > they > > > > > > > > likely > > > > > > > > > >>> to > > > > > > > > > >>>>>> just > > > > > > > > > >>>>>>> convert this type to ListArray at import > boundaries? > > > > > > > > > >>>>>>> > > > > > > > > > >>>>>>> Because if it turns out to be the latter, then we > > might > > > > as > > > > > > well > > > > > > > > ask > > > > > > > > > >>>> Velox > > > > > > > > > >>>>>>> to export this type as ListArray and save the rest > > of the > > > > > > > > ecosystem > > > > > > > > > >>>> some > > > > > > > > > >>>>>>> work. > > > > > > > > > >>>>>>> > > > > > > > > > >>>>>>> Best, > > > > > > > > > >>>>>>> > > > > > > > > > >>>>>>> Will Jones > > > > > > > > > >>>>>>> > > > > > > > > > >>>>>>> On Thu, May 11, 2023 at 12:32 PM Felipe Oliveira > > > > Carvalho < > > > > > > > > > >>>>>>> felipe...@gmail.com<mailto:felipe...@gmail.com>> > > wrote: > > > > > > > > > >>>>>>> > > > > > > > > > >>>>>>>> Initial reason for ListView arrays in Arrow is > > zero-copy > > > > > > > > > >>> compatibility > > > > > > > > > >>>>>> with > > > > > > > > > >>>>>>>> Velox which uses this format. > > > > > > > > > >>>>>>>> > > > > > > > > > >>>>>>>> Velox kernels benefit from being able to append > > data to > > > > the > > > > > > > > array > > > > > > > > > >>> from > > > > > > > > > >>>>>>>> different threads without care for strict > ordering. > > > > Only the > > > > > > > > > >>> offsets > > > > > > > > > >>>>>> array > > > > > > > > > >>>>>>>> has to be written according to logical order but > > that is > > > > > > > > > >>> potentially a > > > > > > > > > >>>>>> much > > > > > > > > > >>>>>>>> smaller buffer than the values buffer. > > > > > > > > > >>>>>>>> > > > > > > > > > >>>>>>>> Acero kernels could take advantage of that in the > > > > future. > > > > > > > > > >>>>>>>> > > > > > > > > > >>>>>>>> In implementing ListViewArray/Type I was able to > > reuse > > > > some > > > > > > > C++ > > > > > > > > > >>>>>> templates > > > > > > > > > >>>>>>>> used for ListArray which can reduce some of the > > burden > > > > on > > > > > > > kernel > > > > > > > > > >>>>>>>> implementations that aim to work with all the > types. > > > > > > > > > >>>>>>>> > > > > > > > > > >>>>>>>> I’m can fix Acero kernels for working with > ListView. > > > > This is > > > > > > > > > >>> similar > > > > > > > > > >>>> to > > > > > > > > > >>>>>> the > > > > > > > > > >>>>>>>> work I’ve doing in kernels dealing with run-end > > encoded > > > > > > > arrays. > > > > > > > > > >>>>>>>> > > > > > > > > > >>>>>>>> — > > > > > > > > > >>>>>>>> Felipe > > > > > > > > > >>>>>>>> > > > > > > > > > >>>>>>>> > > > > > > > > > >>>>>>>> On Wed, 26 Apr 2023 at 01:03 Will Jones < > > > > > > > > will.jones...@gmail.com > > > > > > > > > >>>>>> <mailto:will.jones...@gmail.com>> wrote: > > > > > > > > > >>>>>>>> > > > > > > > > > >>>>>>>>> I suppose one common use case is materializing > list > > > > columns > > > > > > > > after > > > > > > > > > >>>> some > > > > > > > > > >>>>>>>>> expanding operation like a join or unnest. > That's a > > > > case > > > > > > > where > > > > > > > > I > > > > > > > > > >>>> could > > > > > > > > > >>>>>>>>> imagine a lot of repetition of values. Haven't > yet > > > > thought > > > > > > of > > > > > > > > > >>> common > > > > > > > > > >>>>>>>> cases > > > > > > > > > >>>>>>>>> where there is overlap but not full duplication, > > but am > > > > > > eager > > > > > > > > to > > > > > > > > > >>> hear > > > > > > > > > >>>>>>>> any. > > > > > > > > > >>>>>>>>> The dictionary encoding point Raphael makes is > > > > interesting, > > > > > > > > > >>>> especially > > > > > > > > > >>>>>>>>> given the existence of LargeList and > > FixedSizeList. For > > > > > > many > > > > > > > > > >>>>>> operations, > > > > > > > > > >>>>>>>> it > > > > > > > > > >>>>>>>>> might make more sense to just compose those > > existing > > > > types. > > > > > > > > > >>>>>>>>> > > > > > > > > > >>>>>>>>> IIUC the operations that would be unique to the > > > > ArrayView > > > > > > are > > > > > > > > > ones > > > > > > > > > >>>>>>>> altering > > > > > > > > > >>>>>>>>> the shape. One could truncate each array to a > > certain > > > > > > length > > > > > > > > > >>> cheaply > > > > > > > > > >>>>>>>> simply > > > > > > > > > >>>>>>>>> by replacing the sizes buffer. Or perhaps there > are > > > > > > > interesting > > > > > > > > > >>>>>>>> operations > > > > > > > > > >>>>>>>>> on tensors that would benefit. > > > > > > > > > >>>>>>>>> > > > > > > > > > >>>>>>>>> On Tue, Apr 25, 2023 at 7:47 PM Raphael > > Taylor-Davies > > > > > > > > > >>>>>>>>> <r.taylordav...@googlemail.com.invalid> wrote: > > > > > > > > > >>>>>>>>> > > > > > > > > > >>>>>>>>>> Unless I am missing something, I think the > > selection > > > > > > > use-case > > > > > > > > > >>> could > > > > > > > > > >>>> be > > > > > > > > > >>>>>>>>>> equally well served by a dictionary-encoded > > > > > > > > > BinarArray/ListArray, > > > > > > > > > >>>> and > > > > > > > > > >>>>>>>>> would > > > > > > > > > >>>>>>>>>> have the benefit of not requiring any > > modifications > > > > to the > > > > > > > > > >>> existing > > > > > > > > > >>>>>>>>> format > > > > > > > > > >>>>>>>>>> or kernels. > > > > > > > > > >>>>>>>>>> > > > > > > > > > >>>>>>>>>> The major additional flexibility of the proposed > > > > encoding > > > > > > > > would > > > > > > > > > >>> be > > > > > > > > > >>>>>>>>>> permitting disjoint or overlapping ranges, are > > these > > > > > > common > > > > > > > > > >>> enough > > > > > > > > > >>>> in > > > > > > > > > >>>>>>>>>> practice to represent a meaningful bottleneck? > > > > > > > > > >>>>>>>>>> > > > > > > > > > >>>>>>>>>> > > > > > > > > > >>>>>>>>>> On 26 April 2023 01:40:14 BST, David Li < > > > > > > > lidav...@apache.org > > > > > > > > > >>>> <mailto: > > > > > > > > > >>>>>> lidav...@apache.org>> wrote: > > > > > > > > > >>>>>>>>>>> Is there a need for a 64-bit offsets version > the > > > > same way > > > > > > > we > > > > > > > > > >>> have > > > > > > > > > >>>>>> List > > > > > > > > > >>>>>>>>>> and LargeList? > > > > > > > > > >>>>>>>>>>> And just to be clear, the difference with List > is > > > > that > > > > > > the > > > > > > > > > lists > > > > > > > > > >>>>>> don't > > > > > > > > > >>>>>>>>>> have to be stored in their logical order (or in > > other > > > > > > words, > > > > > > > > > >>> offsets > > > > > > > > > >>>>>> do > > > > > > > > > >>>>>>>>> not > > > > > > > > > >>>>>>>>>> have to be nondecreasing and so we also need > > sizes)? > > > > > > > > > >>>>>>>>>>> On Wed, Apr 26, 2023, at 09:37, Weston Pace > > wrote: > > > > > > > > > >>>>>>>>>>>> For context, there was some discussion on this > > back > > > > in > > > > > > > [1]. > > > > > > > > At > > > > > > > > > >>>> that > > > > > > > > > >>>>>>>>>> time > > > > > > > > > >>>>>>>>>>>> this was called "sequence view" but I do not > > like > > > > that > > > > > > > name. > > > > > > > > > >>>>>>>> However, > > > > > > > > > >>>>>>>>>>>> array-view array is a little confusing. Given > > this > > > > is > > > > > > > > similar > > > > > > > > > >>> to > > > > > > > > > >>>>>>>> list > > > > > > > > > >>>>>>>>>> can > > > > > > > > > >>>>>>>>>>>> we go with list-view array? > > > > > > > > > >>>>>>>>>>>> > > > > > > > > > >>>>>>>>>>>>> Thanks for the introduction. I'd be > interested > > to > > > > hear > > > > > > > > about > > > > > > > > > >>> the > > > > > > > > > >>>>>>>>>>>>> applications Velox has found for these > vectors, > > > > and in > > > > > > > what > > > > > > > > > >>>>>>>>> situations > > > > > > > > > >>>>>>>>>>>> they > > > > > > > > > >>>>>>>>>>>>> are useful. This could be contrasted with the > > > > current > > > > > > > > > >>> ListArray > > > > > > > > > >>>>>>>>>>>>> implementations. > > > > > > > > > >>>>>>>>>>>> I believe one significant benefit is that take > > (and > > > > by > > > > > > > > proxy, > > > > > > > > > >>>>>>>> filter) > > > > > > > > > >>>>>>>>>> and > > > > > > > > > >>>>>>>>>>>> sort are O(# of items) with the proposed > format > > and > > > > O(# > > > > > > of > > > > > > > > > >>> bytes) > > > > > > > > > >>>>>>>> with > > > > > > > > > >>>>>>>>>> the > > > > > > > > > >>>>>>>>>>>> current format. Jorge did some profiling to > this > > > > effect > > > > > > in > > > > > > > > > [1]. > > > > > > > > > >>>>>>>>>>>> > > > > > > > > > >>>>>>>>>>>> [1] > > > > > > > > > >>>>>>>> > > > > > > > > > > https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq< > > > > > > > > > >>>>>> > > > > > > > > https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq > > > > > > > > > > > > >>>>>>>>>>>> On Tue, Apr 25, 2023 at 3:13 PM Will Jones < > > > > > > > > > >>>> will.jones...@gmail.com > > > > > > > > > >>>>>> <mailto:will.jones...@gmail.com> > > > > > > > > > >>>>>>>>>> wrote: > > > > > > > > > >>>>>>>>>>>>> Hi Felipe, > > > > > > > > > >>>>>>>>>>>>> > > > > > > > > > >>>>>>>>>>>>> Thanks for the introduction. I'd be > interested > > to > > > > hear > > > > > > > > about > > > > > > > > > >>> the > > > > > > > > > >>>>>>>>>>>>> applications Velox has found for these > vectors, > > > > and in > > > > > > > what > > > > > > > > > >>>>>>>>> situations > > > > > > > > > >>>>>>>>>> they > > > > > > > > > >>>>>>>>>>>>> are useful. This could be contrasted with the > > > > current > > > > > > > > > >>> ListArray > > > > > > > > > >>>>>>>>>>>>> implementations. > > > > > > > > > >>>>>>>>>>>>> > > > > > > > > > >>>>>>>>>>>>> IIUC it would be fairly cheap to transform a > > > > ListArray > > > > > > to > > > > > > > > an > > > > > > > > > >>>>>>>>>> ArrayView, but > > > > > > > > > >>>>>>>>>>>>> expensive to go the other way. > > > > > > > > > >>>>>>>>>>>>> > > > > > > > > > >>>>>>>>>>>>> Best, > > > > > > > > > >>>>>>>>>>>>> > > > > > > > > > >>>>>>>>>>>>> Will Jones > > > > > > > > > >>>>>>>>>>>>> > > > > > > > > > >>>>>>>>>>>>> On Tue, Apr 25, 2023 at 3:00 PM Felipe > Oliveira > > > > > > Carvalho > > > > > > > < > > > > > > > > > >>>>>>>>>>>>> felipe...@gmail.com<mailto: > felipe...@gmail.com > > >> > > > > > > wrote: > > > > > > > > > >>>>>>>>>>>>> > > > > > > > > > >>>>>>>>>>>>>> Hi folks, > > > > > > > > > >>>>>>>>>>>>>> > > > > > > > > > >>>>>>>>>>>>>> I would like to start a public discussion on > > the > > > > > > > inclusion > > > > > > > > > >>> of a > > > > > > > > > >>>>>>>> new > > > > > > > > > >>>>>>>>>> array > > > > > > > > > >>>>>>>>>>>>>> format to Arrow — array-view array. The name > > is > > > > also > > > > > > up > > > > > > > > for > > > > > > > > > >>>>>>>> debate. > > > > > > > > > >>>>>>>>>>>>>> This format is inspired by Velox's > ArrayVector > > > > format > > > > > > > [1]. > > > > > > > > > >>>>>>>>> Logically, > > > > > > > > > >>>>>>>>>>>>> this > > > > > > > > > >>>>>>>>>>>>>> array represents an array of arrays. Each > > element > > > > is > > > > > > an > > > > > > > > > >>>>>>>> array-view > > > > > > > > > >>>>>>>>>>>>> (offset > > > > > > > > > >>>>>>>>>>>>>> and size pair) that points to a range > within a > > > > nested > > > > > > > > > >>> "values" > > > > > > > > > >>>>>>>>> array > > > > > > > > > >>>>>>>>>>>>>> (called "elements" in Velox docs). The > nested > > > > array > > > > > > can > > > > > > > be > > > > > > > > > of > > > > > > > > > >>>> any > > > > > > > > > >>>>>>>>>> type, > > > > > > > > > >>>>>>>>>>>>>> which makes this format very flexible and > > > > powerful. > > > > > > > > > >>>>>>>>>>>>>> > > > > > > > > > >>>>>>>>>>>>>> [image: ../_images/array-vector.png] > > > > > > > > > >>>>>>>>>>>>>> < > > > > > > > > > >>>>>>>>> > > > > > > > > > >>> > > > > > > > https://facebookincubator.github.io/velox/_images/array-vector.png > > > > > > > < > > > > > > > > > >>>>>> > > > > > > > > > > https://facebookincubator.github.io/velox/_images/array-vector.png > > > > > > > > > >> > > > > > > > > > >>>>>>>>>>>>>> I'm currently working on a C++ > implementation > > and > > > > plan > > > > > > > to > > > > > > > > > >>> work > > > > > > > > > >>>>>>>> on a > > > > > > > > > >>>>>>>>>> Go > > > > > > > > > >>>>>>>>>>>>>> implementation to fulfill the > > two-implementations > > > > > > > > > requirement > > > > > > > > > >>>> for > > > > > > > > > >>>>>>>>>> format > > > > > > > > > >>>>>>>>>>>>>> changes. > > > > > > > > > >>>>>>>>>>>>>> > > > > > > > > > >>>>>>>>>>>>>> The draft design: > > > > > > > > > >>>>>>>>>>>>>> > > > > > > > > > >>>>>>>>>>>>>> - 3 buffers: [validity_bitmap, int32 offsets > > > > buffer, > > > > > > > int32 > > > > > > > > > >>> sizes > > > > > > > > > >>>>>>>>>> buffer] > > > > > > > > > >>>>>>>>>>>>>> - 1 child array: "values" as an array of the > > type > > > > > > > > parameter > > > > > > > > > >>>>>>>>>>>>>> > > > > > > > > > >>>>>>>>>>>>>> validity_bitmap is used to differentiate > > between > > > > empty > > > > > > > > array > > > > > > > > > >>>>>>>> views > > > > > > > > > >>>>>>>>>>>>>> (sizes[i] == 0) and NULL array views > > > > > > (validity_bitmap[i] > > > > > > > > == > > > > > > > > > >>> 0). > > > > > > > > > >>>>>>>>>>>>>> > > > > > > > > > >>>>>>>>>>>>>> When the validity_bitmap[i] is 0, both sizes > > and > > > > > > offsets > > > > > > > > are > > > > > > > > > >>>>>>>>>> undefined > > > > > > > > > >>>>>>>>>>>>> (as > > > > > > > > > >>>>>>>>>>>>>> usual), and when sizes[i] == 0, offsets[i] > is > > > > > > > undefined. 0 > > > > > > > > > is > > > > > > > > > >>>>>>>>>> recommended > > > > > > > > > >>>>>>>>>>>>>> if setting a value is not an issue to the > > system > > > > > > > producing > > > > > > > > > >>> the > > > > > > > > > >>>>>>>>>> arrays. > > > > > > > > > >>>>>>>>>>>>>> offsets buffer is not required to be ordered > > and > > > > views > > > > > > > > don't > > > > > > > > > >>>> have > > > > > > > > > >>>>>>>>> to > > > > > > > > > >>>>>>>>>> be > > > > > > > > > >>>>>>>>>>>>>> disjoint. > > > > > > > > > >>>>>>>>>>>>>> > > > > > > > > > >>>>>>>>>>>>>> [1] > > > > > > > > > >>>>>>>>>>>>>> > > > > > > > > > >>>>>>>> > > > > > > > > > >>>>>> > > > > > > > > > >>>> > > > > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://facebookincubator.github.io/velox/develop/vectors.html#arrayvector > > > > > > > > > >>>>>> < > > > > > > > > > >>>>>> > > > > > > > > > >>>> > > > > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://facebookincubator.github.io/velox/develop/vectors.html#arrayvector > > > > > > > > > >>>>>>> > > > > > > > > > >>>>>>>>>>>>>> Thanks, > > > > > > > > > >>>>>>>>>>>>>> Felipe O. Carvalho > > > > > > > > > >>>>>>>>>>>>>> > > > > > > > > > >>>>>> > > > > > > > > > >>>> > > > > > > > > > >>>> > > > > > > > > > >>> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >