> Even if ListView is rarely used for interoperability (if it never gains wide adoption), some of the arrow implementations could use ListView to offer faster computation kernels, which I think has real value
This is an important point, thanks for the clear phrasing Andrew! On Thu, Jun 15, 2023 at 5:32 PM Felipe Oliveira Carvalho < felipe...@gmail.com> wrote: > On Wed, Jun 14, 2023 at 5:07 PM Raphael Taylor-Davies > <r.taylordav...@googlemail.com.invalid> wrote: > > > Even something relatively straightforward becomes a huge implementation > > effort when multiplied by a large number of codebases, users and > > datasets. Parquet is a great source of historical examples of the > > challenges of incremental changes that don't meaningfully unlock new > > use-cases. To take just one, Int96 was deprecated almost a decade ago, > > in favour of some additional metadata over an existing physical layout, > > and yet Int96 is still to the best of my knowledge used by Spark by > > default. > > > > That's not to say that I think the arrow specification should ossify and > > we should never change it, but I'm not hugely enthusiastic about adding > > encodings that are only incremental improvements over existing encodings. > > > > I therefore wonder if there are some new use-cases I am missing that > > would be unlocked by this change, and that wouldn't be supported by the > > dictionary proposal? Perhaps you could elaborate here? > > > The dict-encoded ListArray would add a memory indirection: the slot in the > dictionary has to be resolved before the offset can be. With ListView, the > CPU can access offset and size without waiting on that memory access. Both > need two buffer accesses after the validity check, but ListView doesn't > have that dependency. > > > > Whilst I do agree > > using dictionaries as proposed is perhaps a less elegant solution, I > > don't see anything inherently wrong with it, and if it ain't broke we > > really shouldn't be trying to fix it. > > > > Kind Regards, > > > > Raphael Taylor-Davies > > > > On 14 June 2023 17:52:52 BST, Felipe Oliveira Carvalho > > <felipe...@gmail.com> wrote: > > > > General approach to alternative formats aside, in the specific case > > of ListView, I think the implementation complexity is being > > overestimated in these discussions. The C++ Arrow implementation > > shares a lot of code between List and LargeList. And with some > > tweaks, I'm able to share that common infrastructure for ListView as > > well. [1] ListView is similar to list: it doesn't require offsets to > > be sorted and adds an extra buffer containing sizes. For symmetry > > with the List and LargeList types (FixedSizeList not included), I'm > > going to propose we add a LargeListView. That is not part of the > > draft implementation yet, but seems like an obvious thing to have > > now that I implemented the `if_else` specialization. [2] David Li > > asked about this above and I can confirm now that 64-bit version of > > ListView (LargeListView) is in the plans. Trying to avoid > > re-implementing some kernels is not a good goal to chase, IMO, > > because kernels need tweaks to take advantage of the format. [1] > > https://github.com/apache/arrow/pull/35345 [2] > > https://github.com/felipecrv/arrow/commits/list_view_backup -- > > Felipe On Wed, Jun 14, 2023 at 12:08 PM Weston Pace > > <weston.p...@gmail.com> wrote: > > > > perhaps we could support this use-case as a canonical > > extension type over dictionary encoded, variable-sized arrays > > > > I believe this suggestion is valid and could be used to solve > > the if-else case. The algorithm, if I understand it, would be > > roughly: ``` // Note: Simple pseudocode, vectorization left as > > exercise for the reader auto indices_builder = ... auto > > list_builder = ... indices_builder.resize(batch.length); Array > > condition_mask = condition.EvaluateBatch(batch); for row_index > > in selected_rows(condition_mask): indices_builder[row_index] = > > list_builder.CurrentLength(); > > list_builder.Append(if_expr.EvaluateRow(batch, row_index)) for > > row_index in unselected_rows(condition_mask): > > indices_builder[row_index] = list_builder.CurrentLength(); > > list_builder.Append(else_expr.EvaluateRow(batch, row_index)) > > return DictionaryArray(indices_builder.Finish(), > > list_builder.Finish()) ``` I also agree this is a slightly > > awkward use of dictionaries (e.g. the dictionary would have the > > same size as the # of indices) and perhaps not the most > > intuitive way to solve the problem. My gut reaction is simply > > "an improved if/else kernel is not, alone, enough justification > > for a new layout" and yet... I think we are seeing the start (I > > hope) of a trend where Arrow is not just used "between systems" > > (e.g. to shuttle data from one place to another, or between a > > query engine and a visualization tool) but also "within systems" > > (e.g. UDFs, bespoke file formats and temporary tables, between > > workers in a distributed query engine). When arrow is used > > "within systems" I think both the number of bespoke formats and > > the significance of conversion cost increases. For example, it's > > easy to say that Velox should convert at the boundary as data > > leaves Velox. But what if Velox (or datafusion or ...) were to > > define an interface for UDFs. Would we want to use Arrow there > > (e.g. the C data interface is a good fit)? If so, wouldn't the > > conversion cost be more significant? > > > > Also, I'm very lukewarm towards the concept of "alternative > > layouts" suggested somewhere else in this thread. It does > > not seem a good choice to complexify the Arrow format that > > much. > > > > I think, in my opinion, this depends on how many of these > > alternative layouts exist. If there are just a few, then I > > agree, we should just adopt them as formal first-class layouts. > > If there are many, then I think it will be too much complexity > > in Arrow to have all the different choices. Or, we could say > > there are many, but the alternatives don't belong in Arrow at > > all. In that case I think it's the same question as the above > > paragraph, "do we want Arrow to be used within systems? Or just > > between systems?" On Wed, Jun 14, 2023 at 2:07 AM Antoine Pitrou > > <anto...@python.org> wrote: > > > > I agree that ListView cannot be an extension type, given > > that it features a new layout, and therefore cannot > > reasonably be backed by an existing storage type (AFAICT). > > Also, I'm very lukewarm towards the concept of "alternative > > layouts" suggested somewhere else in this thread. It does > > not seem a good choice to complexify the Arrow format that > > much. Regards Antoine. Le 07/06/2023 à 00:21, Felipe > > Oliveira Carvalho a écrit : > > > > +1 on what Ian said. And as I write kernels for this new > > format, I’m learning that it’s > > > > possible > > > > to re-use the common infrastructure used by List and > > LargeList to > > > > implement > > > > the ListView related features with some adjustments. IMO > > having this format as a second-class citizen would more > > likely complicate things because it would make this > > unification harder. — Felipe On Tue, 6 Jun 2023 at 18:45 > > Ian Cook <ianmc...@apache.org> wrote: > > > > To clarify why we cannot simply propose adding > > ListView as a new “canonical extension type”: The > > extension type mechanism in Arrow depends on the > > underlying data being organized in an existing Arrow > > layout—that way an implementation that does not > > support the extension type can still handle the > > underlying data. But ListView is a wholly new > > layout. I strongly agree with Weston’s idea that it > > is a good time for Arrow to introduce the notion of > > “canonical alternative layouts.” Taken together, I > > think that canonical extension types and canonical > > alternative layouts could serve as an “incubator” > > for proposed new representations. For example, if a > > proposed canonical alternative layout ends up being > > broadly adopted, then that will serve as a signal > > that we should consider adding it as a primary > > layout in the core spec. It seems to me that most > > projects that are implementing Arrow today are not > > aiming to provide complete coverage of Arrow; rather > > they are adopting Arrow because of its role as a > > standard and they are implementing only as much of > > the Arrow standard as they require to achieve some > > goal. I believe that such projects are important > > Arrow stakeholders, and I believe that this proposed > > notion of canonical alternative layouts will serve > > them well and will create efficiencies by > > standardizing implementations around a shared set of > > alternatives. However I think that the documentation > > for canonical alternative layouts should strongly > > encourage implementers to default to using the > > primary layouts defined in the core spec and only > > use alternative layouts in cases where the primary > > layouts do not meet their needs. On Sat, May 27, > > 2023 at 7:44 PM Micah Kornfield < > > > > emkornfi...@gmail.com> > > > > wrote: > > > > This sounds reasonable to me but my main concern > > is, I'm not sure > > > > there > > > > is > > > > a great mechanism to enforce canonical layouts > > don't somehow become > > > > default > > > > (or the only implementation). Even for these new > > layouts, I think it might be worth rethinking > > > > binding > > > > a > > > > layout into the schema versus having a different > > concept of encoding > > > > (and > > > > changing some of the corresponding data > > structures). On Mon, May 22, 2023 at 10:37 AM > > Weston Pace <weston.p...@gmail.com> > > > > wrote: > > > > Trying to settle on one option is a > > fruitless endeavor. Each type > > > > has > > > > pros > > > > and cons. I would also predict that the > > largest existing usage of > > > > Arrow is > > > > shuttling data from one system to another. > > The newly proposed > > > > format > > > > doesn't appear to have any significant > > advantage for that use case > > > > (if > > > > anything, the existing format is arguably > > better as it is more > > > > compact). > > > > I am very biased towards historical > > precedent and avoiding breaking changes. We > > have "canonical extension types", perhaps it > > is time for > > > > "canonical > > > > alternative layouts". We could define it as > > such: * There are one or more primary > > layouts * Existing layouts are automatically > > considered primary layouts, > > > > even if > > > > they wouldn't have been primary layouts > > initially (e.g. large list) * A new layout, > > if it is semantically equivalent to another, > > is > > > > considered > > > > an alternative layout * An alternative > > layout still has the same requirements for > > > > adoption > > > > (two > > > > implementations and a vote) * An > > implementation should not feel pressured to > > rush and > > > > implement > > > > the > > > > new layout. It would be good if they > > contribute in the discussion and > > > > consider the > > > > layout and vote if they feel it would be an > > acceptable design. * We can define and vote > > and approve as many canonical alternative > > > > layouts > > > > as we want: * A canonical alternative layout > > should, at a minimum, have some reasonable > > justification, such as improved performance > > for > > > > algorithm X > > > > * Arrow implementations MUST support the > > primary layouts * An Arrow implementation > > MAY support a canonical alternative, > > > > however: > > > > * An Arrow implementation MUST first support > > the primary layout * An Arrow implementation > > MUST support conversion to/from the > > > > primary > > > > and canonical layout * An Arrow > > implementation's APIs MUST only provide data > > in the alternative layout if it is > > explicitly asked for (e.g. schema inference > > > > should > > > > prefer the primary layout). * We can still > > vote for new primary layouts (e.g. promoting > a > > > > canonical > > > > alternative) but, in these votes we don't > > only consider the value (e.g. performance) of > > > > the > > > > layout > > > > but also the interoperability. In other > > words, a layout can only become a primary > > layout if > > > > there > > > > is > > > > significant evidence that most > > implementations plan to adopt it. This lets > > us evolve support for new layouts more > > naturally. We can generally assume that > > users will not, initially, be aware of these > > alternative layouts. However, everything > > will just work. They may > > > > start > > > > to see a performance penalty stemming from a > > lack of support for > > > > these > > > > layouts. If this performance penalty becomes > > significant then they > > > > will > > > > discover it and become aware of the problem. > > They can then ask > > > > whatever > > > > library they are using to add support for > > the alternative layout. > > > > As > > > > enough users find a need for it then > > libraries will add support. Eventually, > > enough libraries will support it that we can > > adopt it > > > > as a > > > > primary layout. Also, it allows libraries to > > adopt alternative layouts more > > > > aggressively if > > > > they would like while still hopefully > > ensuring that we eventually > > > > all > > > > converge on the same implementation of the > > alternative layout. On Mon, May 22, 2023 at > > 9:35 AM Will Jones <will.jones...@gmail.com > > > > wrote: > > > > Hello Arrow devs, I don't understand why > > we would start deprecating features in > the > > > > Arrow > > > > format. Even starting this talk > > might already be a bad idea > > > > PR-wise. > > > > I agree we don't want to make breaking > > changes to the Arrow format. > > > > But > > > > several maintainers have already stated > > they have no interest in maintaining > > both list types with full compute > > functionality [1][2], > > > > so I > > > > think it's very likely one list type or > > the other will be implicitly preferred > > in the ecosystem if this data type was > > added. > > > > If > > > > that's the case, I'd prefer that we > > agreed as a community which one > > > > should > > > > be preferred. Maybe that's not the best > > path; it's just one way for > > > > us to > > > > balance stability, maintenance burden, > > and relevance. Can someone help distill > > down the primary rationale and usecase > for > > > > adding ArrayView to the Arrow Spec? > > > > Looking back at that old thread, I think > > one of the main > > > > motivations > > > > is > > > > to > > > > try to prevent query engine implementers > > from feeling there is a > > > > tradeoff > > > > between having state-of-the-art > > performance and being Arrow-native. > > > > For > > > > some of the new array types, we had both > > Velox and DuckDB use them, > > > > so it > > > > was reasonable to expect they were > > innovations that might > > > > proliferate. > > > > I'm > > > > not sure if the ArrayView is part of > > that. From Wes earlier [3]: The idea is > > that in a world of data and query > > federation (for > > > > example, > > > > consider [1] where Arrow is being > > used as a data federation layer > > > > with > > > > many > > > > query engines), we want to increase > > the amount of data in-flight > > > > and > > > > in-memory that is in Arrow format. > > So if query engines are having > > > > to > > > > depart > > > > substantially from the Arrow format > > to get performance, then this > > > > creates a > > > > potential lose-lose situation: * > > Depart from Arrow: get better > > > > performance > > > > but pay serialization costs to read > > and write Arrow (the > > > > performance > > > > and > > > > resource utilization benefits > > outweigh the serialization costs). > > > > This > > > > puts > > > > additional pressure on query engines > > to build specialized > > > > components > > > > for > > > > solving problems rather than making > > use of off-the-shelf > > > > components > > > > that > > > > use Arrow. This has knock-on effects > > on ecosystem fragmentation. * > > > > Or > > > > use > > > > Arrow, and accept suboptimal query > > processing performance > > > > Will mentions one usecase is Velox > > consuming python UDF output, > > > > which > > > > seems > > > > to be mostly about how fast Velox > > can consume this format, not how > > > > fast > > > > it > > > > can be written. Are there other > > usecases? > > > > To be clear, I don't know if that's the > > use case they want. That's > > > > just > > > > me > > > > speculating. I still have some questions > > myself: 1. Is this array type currently > > only used in Velox? (not DuckDB > > > > like > > > > some > > > > of the other new types?) What evidence > > do we have that it will > > > > become > > > > used > > > > outside of Velox? 2. We already have > > three list types: list, large list > (64-bit > > > > offsets), > > > > and > > > > fixed size list. Do we think we will > > only want a view version of > > > > the > > > > 32-bit > > > > offset variable length list? Or are we > > potentially talking about > > > > view > > > > variants for all three? Best, Will Jones > > [1] > > > > https://lists.apache.org/thread/smn13j1rnt23mb3fwx75sw3f877k3nwx > > > > [2] > > > > https://lists.apache.org/thread/cc4w3vs3foj1fmpq9x888k51so60ftr3 > > > > [3] > > > > https://lists.apache.org/thread/mk2yn62y6l8qtngcs1vg2qtwlxzbrt8t > > > > On Mon, May 22, 2023 at 3:48 AM Andrew > > Lamb <al...@influxdata.com> > > > > wrote: > > > > Can someone help distill down the > > primary rationale and usecase > > > > for > > > > adding ArrayView to the Arrow Spec? > > From the above discussions, the > > stated rationale seems to be fast > > (zero-copy) interchange with Velox. > > This thread has qualitatively > > enumerated the benefits of > > > > (offset+len) > > > > encoding over the existing Arrow > > ListArray (offets) approach, but > > > > I > > > > haven't > > > > seen any performance measurements > > that might help us to gauge the > > > > tradeoff > > > > in additional complexity vs runtime > > overhead. Will mentions one usecase > > is Velox consuming python UDF output, > > > > which > > > > seems > > > > to be mostly about how fast Velox > > can consume this format, not how > > > > fast > > > > it > > > > can be written. Are there other > > usecases? Do we have numbers showing > > how much overhead converting to /from > > > > Velox's > > > > internal representation and the > > existing ListArray adds? Has > > > > anyone in > > > > Velox land considered adding faster > > support for Arrow style > > > > ListArray > > > > encoding? Andrew On Mon, May 22, > > 2023 at 4:38 AM Antoine Pitrou < > > > > anto...@python.org > > > > wrote: > > > > Hi, I don't understand why we > > would start deprecating features > > in the > > > > Arrow > > > > format. Even starting this talk > > might already be a bad idea > > > > PR-wise. > > > > As for implementing conversions > > at the I/O boundary, it's a > > > > reasonably > > > > policy, but it still requires > > work by implementors and it's not > > > > granted > > > > that all consumers of the Arrow > > format will grow such > > conversions if/when we add > > non-trivial types such as > > ListView or StringView. Regards > > Antoine. Le 22/05/2023 à 00:39, > > Will Jones a écrit : > > > > One more thing: Looking back > > on the previous discussion[1] > > > > (which > > > > Weston > > > > pointed out in his earlier > > message), Jorge suggested > > that the > > > > old > > > > list > > > > types might be deprecated in > > favor of view variants [2]. > > Others > > > > were > > > > worried that it might > > undermine the perception > > that the Arrow > > > > format > > > > is > > > > stable. I think it might be > > worth thinking about "soft > > > > deprecating" > > > > the > > > > old > > > > list type: suggesting new > > implementations prefer the > > list > > > > view, but > > > > reassuring that > > implementations should > > support the old format, > > > > even > > > > if > > > > they > > > > just convert to the new > > format. To be clear, this > > wouldn't > > > > mean we > > > > plan > > > > to > > > > create breaking changes in > > the format; but if we ever > > did for > > > > other > > > > reasons, the old list type > > might go. Arrow compute > > libraries could choose > > either format for compute > > > > support, > > > > and > > > > plan to do conversion at the > > boundaries. Libraries that > use > > > > the new > > > > type > > > > will have cheap conversion > > when reading the old type. > > Meanwhile > > > > those > > > > that > > > > are still on the old type > > will have some incentive to > > move > > > > towards > > > > the > > > > new > > > > one, since that conversion > > will not be as efficient. [1] > > > > > > https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq > > > > > > [2] > > > > > > https://lists.apache.org/thread/smn13j1rnt23mb3fwx75sw3f877k3nwx > > > > > > On Sun, May 21, 2023 at > > 3:07 PM Will Jones < > > > > will.jones...@gmail.com> > > > > wrote: > > > > Hello, I think Sasha > > brings up a good point, > > that the advantages of > > > > this > > > > format > > > > seem to be primarily > > about query processing. > > Other encodings > > > > like > > > > REE > > > > and > > > > dictionary have > > space-saving advantages > > that justify them > > > > simply > > > > in > > > > terms > > > > of space efficiency > > (although they have > > query processing > > > > advantages > > > > as > > > > well). As discussed, > > most use cases are > > already well served by > > > > existing > > > > list types and > > dictionary encoding. I > > agree that there are > > cases where transferring > > this type > > > > without > > > > conversion would be > > ideal. One use case I > > can think of is if > > > > Velox > > > > wants to > > > > be able to take > > Arrow-based UDFs > > (possibly written with > > > > PyArrow, > > > > for > > > > example) that operate on > > this column type and > > therefore wants > > > > zero-copy > > > > exchange over the C Data > > Interface. One big > > question I have: we > > already have three list > > types: > > > > list, > > > > large > > > > list (64-bit offsets), > > and fixed size list. Do > > we think we > > > > will > > > > only > > > > want a > > > > view version of the > > 32-bit offset variable > > length list? Or > > > > are we > > > > potentially talking > > about view variants for > > all three? Best, Will > > Jones On Sun, May 21, > > 2023 at 2:19 PM Felipe > > Oliveira Carvalho < > > felipe...@gmail.com> > > wrote: > > > > The benefit of > > having a memory > > format that’s > > friendly to > > > > non-deterministic > > > > order writes is > > unlocked by the > > transport and > > processing of > > > > the > > > > data > > > > being > > > > agnostic to the > > physical order as > > much as possible. > > Requiring a > > conversion could > > cancel out that > > benefit. But it > > > > can > > > > be a > > > > provisory step for > > compatibility > > between systems that > > don’t > > > > understand > > > > the > > > > format yet. This is > > similar to the > > situation with > > compression > > > > schemes > > > > like > > > > run-end encoding — > > the goal is > > processing the > > compressed data > > > > directly > > > > without an expansion > > step whenever > > possible. This is > > why having it as > > part of the open > > Arrow format is so > > > > important: > > > > everyone can agree > > on a format that’s > > friendly to parallel > > > > and/or > > > > vectorized compute > > kernels without > > introducing multiple > > > > incompatible > > > > formats to the > > ecosystem and > > without imposing a > > conversion > > > > step > > > > between > > > > the different > > systems. — Felipe On > > Sat, 20 May 2023 at > > 20:04 Aldrin > > > > <octalene....@pm.me.invalid> > > > > wrote: > > > > I don't feel > > like this > > representation > > is necessarily a > > > > detail of > > > > the > > > > query > > > > engine, but I am > > also not sure > > why this > > representation > > would > > > > have > > > > to > > > > be > > > > converted to a > > non-view format > > when > > serializing. > > Could you > > > > clarify > > > > that? My > > > > impression is > > that this > > representation > > could be used for > > > > persistence > > > > or > > > > data transfer, > > though it can be > > more complex to > > guarantee > > > > the > > > > portion > > > > of > > > > the buffer that > > an index points > > to is also > > present in > > > > memory. > > > > Sent from Proton > > Mail for iOS On > > Sat, May 20, > > 2023 at 15:00, > > Sasha Krassovsky > < > > > > > > krassovskysa...@gmail.com > > > > > > > > <On+Sat,+May+20,+2023+at+15:00,+Sasha+Krassovsky+%3C%3Ca+href=>> > > > > > > wrote: > > > > Hi everyone, I > > understand that > > there are > > numerous > > benefits to this > > > > representation > > > > during query > > processing, but > > would it be fair > > to say that > > > > this > > > > is > > > > an > > > > implementation > > detail of the > > query engine? > > Query engines > > > > don’t > > > > necessarily > > > > need to conform > > to the Arrow > > format > > internally, only > > at > > > > ingest/egress > > > > points, and > > performing a > > conversion from > > the non-view to > > > > view > > > > format > > > > seems > > > > like it would be > > very cheap > > (though I > > understand not > > > > necessarily > > > > the > > > > other > > > > way around, but > > you’d need to do > > that anyway if > > you’re > > > > serializing). > > > > Sasha Krassovsky > > > > 20 мая 2023 > > г., в 13:00, > > Will Jones < > > > > will.jones...@gmail.com> > > > > написал(а): > > > > Thanks for > > sharing > > these > > details, > > Pedro. The > > conditional > > > > branches > > > > argument > > > > makes a lot > > of sense to > > me. The > > tensors > > point brings > > up some > > interesting > > issues. For > > > > now, > > > > we've > > > > defined > > > > our only > > tensor > > extension > > type to be > > built on a > > fixed size > > > > list. > > > > If a > > > > use > > > > case of this > > might be > > manipulating > > tensors with > > zero copy, > > > > perhaps > > > > that > > > > suggests > > that we want > > a fixed size > > list > > variant? In > > > > addition, > > > > would > > > > we > > > > have > > > > to define > > another > > extension > > type to be a > > ListView > > variant? > > > > Or > > > > would > > > > we > > > > want > > > > to think > > about making > > extension > > types > > somehow > > valid across > > > > various > > > > encodings of > > the same > > "logical > type"? > > > > On Fri, > > May 19, > > 2023 at > > 1:59 PM > > Pedro > > Eugenio > > Rocha > > > > Pedreira > > > > > > <pedro...@meta.com.invalid> > > wrote: > > Hi all, > > This is > > Pedro > > from the > > Velox > > team at > > Meta. > > This is > my > > > > first > > > > time > > > > here, > > > > so > > > > nice to > > e-meet > > you! > > Adding > > to what > > Felipe > > said, > > the main > > reason > > we > created > > > > “ListView” > > > > (though > > > > we just > > call > > them > > > > ArrayVector/MapVector > > in > > Velox) > > is that, > > > > along > > > > with > > > > > StringViews > > for > > strings, > > they > > allow us > > to write > > any > > > > columnar > > > > buffer > > > > > > out-or-order, > > > regardless > > of their > > types or > > > encodings. > > > > This is > > > > naturally > > > > doable > > for all > > primitive > > types > > > > (fixed-size), > > but not > > for > > > > types > > > > that > > > > don’t > > > > have > > fixed > > size and > > are > > required > > to be > > > contiguous. > > The > > > > StringView > > > > and > > > > ListView > > formats > > allow us > > to keep > > this > > invariant > > in > > Velox. > > Being > > able to > > write > > vectors > > > > out-of-order > > is > > useful > > when > > > > executing > > > > > > conditionals > > like > > IF/SWITCH > > > statements, > > which are > > > > pervasive > > > > among > > > > our > > > > > workloads. > > To fully > > vectorize > > it, one > > first > > evaluates > > the > > > > expression, > > > > then > > > > generate > > a bitmap > > > containing > > which > > rows > > take the > > THEN and > > > > which > > > > take > > > > the > > > > ELSE > > branch. > > Then you > > populate > > all rows > > that > > match the > > > > first > > > > branch > > > > by > > > > > evaluating > > the THEN > > > expression > > in a > > > vectorized > > > > (branch-less > > > > and > > > > cache > > > > friendly) > > way, and > > > > subsequently > > the ELSE > > branch. > > If you > > > > can’t > > > > write > > > > them > > > > > > out-of-order, > > you > > would > > either > > have a > > big > > branch > > per row > > > > dispatching > > > > to > > > > the > > > > right > > > expression > > (slow), > > or > > populate > > two > > distinct > > vectors > > > > then > > > > merging > > > > them > > > > at the > > end > > > > (potentially > > even > > slower). > > How much > > faster > our > > > > approach > > > > is > > > > highly > > depends > > on the > > buffer > > sizes > > and > > > > expressions, > > but we > > > > found > > > > it > > > > to > > > > be > > > > faster > > enough > > on > > average > > to > > justify > > us > > extending > > the > > > > underlying > > > > layout. > > > > With > > that > > said, > > this is > > all > > within a > > single > > thread of > > > > execution. > > > > > > Parallelization > > is done > > by > > feeding > > each > > thread > > its own > > > > vector/data. > > > > As > > > > pointed > > out in a > > previous > > message, > > this > > also > > gives > > you the > > > > flexibility > > > > to > > > > implement > > > cardinality > > > > increasing/reducing > > > operations, > > but > > > > we > > > > don’t > > > > use > > > > it > > > > for that > > purpose. > > > Operations > > like > > > filtering, > > joining, > > > > unnesting > > > > and > > > > similar > > > > are done > > by > > wrapping > > the > > internal > > vector > > in a > > > > dictionary, > > > > as > > > > these > > > > need > > > > to > > > > work not > > only on > > > “ListViews” > > but on > > any data > > types > with > > > > any > > > > encoding. > > > > There > > > > are more > > details > > on > > Section > > 4.2.1 in > > [1] > > Beyond > > this, it > > also > > gives > > > > function/kernel > > > developers > > more > > > > flexibility > > > > to > > > > implement > > > operations > > that > > > manipulate > > > > Arrays/Maps. > > For > > > > example, > > > > operations > > > > that > > slice > > these > > > containers > > can be > > > implemented > > in a > > > > zero-copy > > > > manner > > > > by > > > > just > > > rearranging > > the > > > > lengths/offsets > > indices, > > without > > ever > > > > touching > > > > the > > > > larger > > internal > > buffers. > > This is > > a > > similar > > > motivation > > as > > > > for > > > > StringView > > > > (think > > of > > substr(), > > trim(), > > and > > similar). > > One nice > > last > > > > property > > > > is > > > > that > > > > this > > layout > > allows > > for > > > overlapping > > ranges. > > This is > > > > something > > > > discussed > > > > with > > > > our ML > > people > > to allow > > deduping > > feature > > values > > in a > > tensor > > > > (which > > > > is > > > > fairly > > > > common), > > but not > > something > > we have > > leveraged > > just > > yet. [1] > > - > > > > https://vldb.org/pvldb/vol15/p3372-pedreira.pdf > > Best, -- > > Pedro > > Pedreira > > > > ------------------------------------------------------------------------ > > From: > > Felipe > > Oliveira > > Carvalho > > < > > felipe...@gmail.com> > > Sent: > > Friday, > > May 19, > > 2023 > > 10:01 AM > > To: > > > > dev@arrow.apache.org > > < > > dev@arrow.apache.org> > > Cc: > > Pedro > > Eugenio > > Rocha > > Pedreira > > < > > pedro...@meta.com> > > Subject: > > Re: > > > > [DISCUSS][Format] > > Starting > > the draft > > > > implementation > > > > of > > > > the > > > > ArrayView > > array > > format > > +pedroerp > > On Thu, > > 11 May > > 2023 at > > 17: 51 > > Raphael > > > > Taylor-Davies > > > > <r. > > > > > > taylordavies@ > > > googlemail. > > com. > > invalid> > > wrote: > > Hi All, > > > > > if > > > > we > > > > added > > > > this, do > > we think > > many > > Arrow > > and > > query > > > engine > > > > implementations > > > > (for > > > > example, > > > DataFusion) > > will be > > > > ZjQcmQRYFpfptBannerStart > > This > > Message > > Is From > > an > > External > > Sender > > > > ZjQcmQRYFpfptBannerEnd > > +pedroerp > > On Thu, > > 11 May > > 2023 at > > 17:51 > > Raphael > > > > Taylor-Davies > > > > <r.taylordav...@googlemail.com.invalid> > > wrote: > > Hi All, > > > > if > > we > > added > > this, > > do > > we > > think > > many > > Arrow > > and > > query > > > engine > > > > implementations > > (for > > > > example, > > > > DataFusion) > > will > > be > > > > eager > > > > to > > > > add > > > > full > > > > > support > > for > > the > > type, > > > > including > > > compute > > > > kernels? > > Or > are > > > > they > > > > likely > > > > to > > > > just > > > > > convert > > this > > type > > to > > > > ListArray > > at > > > import > > > > boundaries? > > > > > > I can't > > speak > > for > > query > > engines > > in > > general, > > but at > > least > > > > for > > > > arrow-rs > > > > and by > > extension > > > DataFusion, > > and > > based on > > my > current > > > > understanding > > > > of > > > > the > > use-cases > > I would > > be > > rather > > hesitant > > to add > > support > > > > to the > > > > kernels > > > > for this > > array > > type, > > > definitely > > instead > > favouring > > > > conversion > > > > at > > > > the > > > > edges. > > We > > already > > have > > issues > > with the > > amount > > of code > > > > generation > > > > resulting > > in > > binary > > bloat > > and long > > compile > > times, > > and I > > > > worry > > > > this > > > > would > > > > worsen > > this > > situation > > whilst > > not > > really > > providing > > > > compelling > > > > advantages > > > > for the > > vast > > majority > > of > > workloads > > that > > don't > > interact > > > > with > > > > Velox. > > > > Whilst I > > can > > > definitely > > see that > > the > > ListView > > > > representation > > > > is > > > > probably > > > > a better > > way to > > represent > > variable > > length > > lists > > than what > > > > arrow > > > > settled > > > > upon, > > I'm not > > yet > > convinced > > it is > > > > sufficiently > > better to > > > > incentivise > > > > broad > > ecosystem > > adoption. > > Kind > > Regards, > > Raphael > > > > Taylor-Davies > > > > > > On > > > > 11/05/2023 > > > 21:20, > > Will > > Jones > > > wrote: > > Hi > > > Felipe, > > > Thanks > > for > > the > > > > additional > > > > details. > > > > > > > > Velox > > > > kernels > > > > benefit > > > > from > > > > being > > > > able > > > to > > > > append > > > > data > > > to > > > > the > > > > array > > > > from > > > > > > different > > > > threads > > > > without > > > > care > > > for > > > > strict > > > > ordering. > > > > > > Only the > > > > offsets > > > > array > > > > > has > > > to > > > be > > > > written > > > > according > > > to > > > > logical > > > > order > > > but > > > > that > > > is > > > > potentially a > > > > much > > > > > > smaller > > > > buffer > > > > than > > > the > > > > values > > > > buffer. > > > > > > It > > still > > seems > > to > > me > > like > > > > applications > > are > > still > > > pretty > > > > niche, > > > > as I > > > > suspect > > > > in > > most > > cases > > the > > > > benefits > > are > > > > outweighed > > by > > the > > > costs. > > > > The > > > > benefit > > > > here > > > > seems > > > pretty > > > > limited: > > if > > you > > are > > > trying > > to > > split > > work > > > > between > > > > threads, > > > > > usually > > you > > will > > have > > other > > > levels > > such > > as > > array > > > chunks > > > > to > > > > parallelize. > > > > And > > > > if > > you > > have > > an > > > > incoming > > > stream > > of > > row > > data, > > > you'll > > want > > > > to > > > > append > > > > in > > > > > > predictable > > order > > to > > match > > the > > order > > of > > the > > other > > > > arrays. Am > > > > I > > > > missing > > > > > > something? > > And, > > IIUC, > > the > > cost > > of > > using > > > > ListView > > with > > > > out-of-order > > > > > > values > > > > over > > > > > > ListArray > > is > > you > > lose > > > memory > > > > locality; > > the > > > values > > of > > > > element > > > > 2 > > > > are > > > > no > > > > > longer > > > > adjacent > > to > > the > > > values > > of > > > element > > 1. > > What > > do > you > > > > think > > > > about > > > > that > > > > > > tradeoff? > > I > > don't > > mean > > to > > be > > > > difficult > > about > > this. > > I'm > > > excited > > for > > > > both > > > > the > > > > REE > > > > and > > > > > > StringView > > > arrays, > > but > > this > > one > > I'm > > not > > so > > sure > > about > > > > yet. I > > > > suppose > > > > what I > > > > am > > > trying > > to > > ask > > is, > > if > > we > > added > > this, > > do > > we > > think > > many > > > > Arrow > > > > and > > > > query > > > > > engine > > > > implementations > > (for > > > > example, > > > > DataFusion) > > will > > be > > > > eager > > > > to > > > > add > > > > full > > > > > support > > for > > the > > type, > > > > including > > > compute > > > > kernels? > > Or > are > > > > they > > > > likely > > > > to > > > > just > > > > > convert > > this > > type > > to > > > > ListArray > > at > > > import > > > > boundaries? > > > Because > > if > > it > > turns > > out > > to > > be > > the > > > latter, > > then > > we > > might > > > > as > > > > well > > > > ask > > > > Velox > > > > to > > > export > > this > > type > > as > > > > ListArray > > and > > save > > the > > rest > > of > the > > > > ecosystem > > > > some > > > > work. > > Best, > > Will > > Jones > > On > > Thu, > > May > > 11, > > 2023 > > at > > > > 12:32 PM > > > Felipe > > > > Oliveira > > > > > > Carvalho < > > > > > > felipe...@gmail.com<mailto:felipe...@gmail.com > > > > <mailto:felipe...@gmail.com>>> > > > wrote: > > > > > > Initial > > > > reason > > > for > > > > ListView > > > > arrays > > > in > > > > Arrow > > > is > > > > zero-copy > > > > > > compatibility > > > > with > > > > > > Velox > > > > which > > > > uses > > > > this > > > > format. > > > > Velox > > > > kernels > > > > benefit > > > > from > > > > being > > > > able > > > to > > > > append > > > > data > > > to > > > > the > > > > array > > > > from > > > > > > different > > > > threads > > > > without > > > > care > > > for > > > > strict > > > > ordering. > > > > > > Only the > > > > offsets > > > > array > > > > > has > > > to > > > be > > > > written > > > > according > > > to > > > > logical > > > > order > > > but > > > > that > > > is > > > > potentially a > > > > much > > > > > > smaller > > > > buffer > > > > than > > > the > > > > values > > > > buffer. > > > > Acero > > > > kernels > > > > could > > > > take > > > > advantage > > > of > > > > that > > > in > > > > the > > > > future. > > > > > In > > > > implementing > > > > ListViewArray/Type > > I > > > was > > > > able > > > to > > > > reuse > > > > > > some > > > > C++ > > > > templates > > > > > > used > > > for > > > > ListArray > > > > which > > > can > > > > reduce > > > > some > > > of > > > the > > > > burden > > > > > > on > > > > kernel > > > > > > implementations > > > > that > > > aim > > > to > > > > work > > > > with > > > all > > > the > > > > types. > > > I’m > > > can > > > fix > > > > Acero > > > > kernels > > > for > > > > working > > > > with > > > > ListView. > > > > > > This is > > > > similar > > > > to > > > > the > > > > > > work > > > > I’ve > > > > doing > > > in > > > > kernels > > > > dealing > > > > with > > > > run-end > > > > encoded > > > > > > arrays. > > > > — > > > > Felipe > > > On > > > > Wed, > > > 26 > > > Apr > > > > 2023 > > > at > > > > 01:03 > > > > Will > > > > Jones > > < > > > > will.jones...@gmail.com > > > > <mailto: > > will.jones...@gmail.com > > <mailto: > > will.jones...@gmail.com>>> > > wrote: > > > > > > I > > > > suppose > > > > one > > > > common > > > > use > > > > case > > > > is > > > > materializing > > > > list > > > > > > columns > > > > after > > > > some > > > > > > expanding > > > > operation > > > > like > > > > a > > > > join > > > > or > > > > unnest. > > > > That's > > > > a > > > > > > case > > > > where > > > > I > > > > could > > > > > > imagine > > > > a > > > > lot > > > > of > > > > repetition > > > > of > > > > values. > > > > Haven't > > > > yet > > > > > > thought > > > > of > > > > common > > > > > > cases > > > > > > > > where > > > > there > > > > is > > > > overlap > > > > but > > > > not > > > > full > > > > duplication, > > > > but > > > > am > > > > > > eager > > > > to > > > > hear > > > > > > any. > > > > > > > > The > > > > dictionary > > > > encoding > > > > point > > > > Raphael > > > > makes > > > > is > > > > > > interesting, > > > > especially > > > > > > given > > > > the > > > > existence > > > > of > > > > LargeList > > > > and > > > > FixedSizeList. > > > > For > > > > > > many > > > > > operations, > > > > > it > > > > > > might > > > > make > > > > more > > > > sense > > > > to > > > > just > > > > compose > > > > those > > > > existing > > > > > > types. > > > > > > IIUC > > > > the > > > > operations > > > > that > > > > would > > > > be > > > > unique > > > > to > > > > the > > > > > > ArrayView > > > > are > > > > ones > > > > > > altering > > > > > > > > the > > > > shape. > > > > One > > > > could > > > > truncate > > > > each > > > > array > > > > to > > > > a > > > > certain > > > > > > length > > > > cheaply > > > > > > simply > > > > > > > > by > > > > replacing > > > > the > > > > sizes > > > > buffer. > > > > Or > > > > perhaps > > > > there > > > > are > > > > > > interesting > > > > > > operations > > > > > > > > on > > > > tensors > > > > that > > > > would > > > > benefit. > > > > On > > > > Tue, > > > > Apr > > > > 25, > > > > 2023 > > > > at > > > > 7:47 PM > > > > Raphael > > > > Taylor-Davies > > > > <r.taylordav...@googlemail.com.invalid> > > > > wrote: > > > > > > > > Unless > > > > I > > > > am > > > > missing > > > > something, > > > > I > > > > think > > > > the > > > > selection > > > > > > use-case > > > > could > > > > be > > > > > > equally > > > > well > > > > served > > > > by > > > > a > > > > dictionary-encoded > > > > > > BinarArray/ListArray, > > > > and > > > > > > would > > > > > > > > have > > > > the > > > > benefit > > > > of > > > > not > > > > requiring > > > > any > > > > modifications > > > > > > to the > > > > existing > > > > > > format > > > > > > > > or > > > > kernels. > > > > The > > > > major > > > > additional > > > > flexibility > > > > of > > > > the > > > > proposed > > > > > > encoding > > > > would > > > > be > > > > > > permitting > > > > disjoint > > > > or > > > > overlapping > > > > ranges, > > > > are > > > > these > > > > > > common > > > > enough > > > > in > > > > > > practice > > > > to > > > > represent > > > > a > > > > meaningful > > > > bottleneck? > > > > On > > > > 26 > > > > April > > > > 2023 > > > > 01:40:14 > > > > BST, > > > > David > > > > Li > > > > < > > > > > > lidav...@apache.org > > > > <mailto: > <mailto:> > > > > > > lidav...@apache.org>> > > wrote: > > > > > > Is > > > > there > > > > a > > > > need > > > > for > > > > a > > > > 64-bit > > > > offsets > > > > version > > > > the > > > > > > same way > > > > we > > > > have > > > > List > > > > > > and > > > > LargeList? > > > > > > > > And > > > > just > > > > to > > > > be > > > > clear, > > > > the > > > > difference > > > > with > > > > List > > > > is > > > > > > that > > > > the > > > > lists > > > > don't > > > > > > have > > > > to > > > > be > > > > stored > > > > in > > > > their > > > > logical > > > > order > > > > (or > > > > in > > > > other > > > > > > words, > > > > offsets > > > > do > > > > > > not > > > > > > > > have > > > > to > > > > be > > > > nondecreasing > > > > and > > > > so > > > > we > > > > also > > > > need > > > > sizes)? > > > > > > > > On > > > > Wed, > > > > Apr > > > > 26, > > > > 2023, > > > > at > > > > 09:37, > > > > Weston > > > > Pace > > > > wrote: > > > > > > > > For > > > > context, > > > > there > > > > was > > > > some > > > > discussion > > > > on > > > > this > > > > back > > > > > > in > > > > [1]. > > > > At > > > > that > > > > > > time > > > > > > > > this > > > > was > > > > called > > > > "sequence > > > > view" > > > > but > > > > I > > > > do > > > > not > > > > like > > > > > > that > > > > name. > > > > > > However, > > > > > > > > array-view > > > > array > > > > is > > > > a > > > > little > > > > confusing. > > > > Given > > > > this > > > > > > is > > > > similar > > > > to > > > > > > list > > > > > > > > can > > > > > > > > we > > > > go > > > > with > > > > list-view > > > > array? > > > > > > > > Thanks > > > > for > > > > the > > > > introduction. > > > > I'd > > > > be > > > > interested > > > > to > > > > > > hear > > > > about > > > > the > > > > > > applications > > > > Velox > > > > has > > > > found > > > > for > > > > these > > > > vectors, > > > > > > and in > > > > what > > > > > > situations > > > > > > > > they > > > > > > > > are > > > > useful. > > > > This > > > > could > > > > be > > > > contrasted > > > > with > > > > the > > > > > > current > > > > ListArray > > > > > > implementations. > > > > > > > > I > > > > believe > > > > one > > > > significant > > > > benefit > > > > is > > > > that > > > > take > > > > (and > > > > > > by > > > > proxy, > > > > > > filter) > > > > > > > > and > > > > > > > > sort > > > > are > > > > O(# > > > > of > > > > items) > > > > with > > > > the > > > > proposed > > > > format > > > > and > > > > > > O(# > > > > of > > > > bytes) > > > > > > with > > > > > > > > the > > > > > > > > current > > > > format. > > > > Jorge > > > > did > > > > some > > > > profiling > > > > to > > > > this > > > > > > effect > > > > in > > > > [1]. > > > > > > [1] > > > > > > > > https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq< > > > > > > > > https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq> > > > > > > > > On > > > > Tue, > > > > Apr > > > > 25, > > > > 2023 > > > > at > > > > 3:13 PM > > > > Will > > > > Jones > > > > < > > > > > > > > will.jones...@gmail.com > > > > > > <mailto: > > will.jones...@gmail.com > > <mailto: > > will.jones...@gmail.com>> > > > > > > > > wrote: > > > > > > > > Hi > > > > Felipe, > > > > Thanks > > > > for > > > > the > > > > introduction. > > > > I'd > > > > be > > > > interested > > > > to > > > > > > hear > > > > about > > > > the > > > > > > applications > > > > Velox > > > > has > > > > found > > > > for > > > > these > > > > vectors, > > > > > > and in > > > > what > > > > > > situations > > > > > > > > they > > > > > > > > are > > > > useful. > > > > This > > > > could > > > > be > > > > contrasted > > > > with > > > > the > > > > > > current > > > > ListArray > > > > > > implementations. > > > > IIUC > > > > it > > > > would > > > > be > > > > fairly > > > > cheap > > > > to > > > > transform > > > > a > > > > > > ListArray > > > > to > > > > an > > > > > > ArrayView, > > > > but > > > > > > > > expensive > > > > to > > > > go > > > > the > > > > other > > > > way. > > > > Best, > > > > Will > > > > Jones > > > > On > > > > Tue, > > > > Apr > > > > 25, > > > > 2023 > > > > at > > > > 3:00 PM > > > > Felipe > > > > Oliveira > > > > > > Carvalho > > > > < > > > > > > felipe...@gmail.com<mailto:felipe...@gmail.com > > > > <mailto:felipe...@gmail.com>>> > > > > > > wrote: > > > > > > Hi > > > > folks, > > > > I > > > > would > > > > like > > > > to > > > > start > > > > a > > > > public > > > > discussion > > > > on > > > > the > > > > > > inclusion > > > > of a > > > > > new > > > > > > array > > > > > > > > format > > > > to > > > > Arrow > > > > — > > > > array-view > > > > array. > > > > The > > > > name > > > > is > > > > > > also > > > > up > > > > for > > > > > > debate. > > > > > > > > This > > > > format > > > > is > > > > inspired > > > > by > > > > Velox's > > > > ArrayVector > > > > > > format > > > > [1]. > > > > > > Logically, > > > > > > > > this > > > > > > > > array > > > > represents > > > > an > > > > array > > > > of > > > > arrays. > > > > Each > > > > element > > > > > > is > > > > an > > > > > > array-view > > > > > > > > (offset > > > > > > > > and > > > > size > > > > pair) > > > > that > > > > points > > > > to > > > > a > > > > range > > > > within > > > > a > > > > > > nested > > > > "values" > > > > > > array > > > > > > > > (called > > > > "elements" > > > > in > > > > Velox > > > > docs). > > > > The > > > > nested > > > > > > array > > > > can > > > > be > > > > of > > > > any > > > > > > type, > > > > > > > > which > > > > makes > > > > this > > > > format > > > > very > > > > flexible > > > > and > > > > > > powerful. > > > > > > [image: > > > > ../_images/array-vector.png] > > > > < > > > > > > > > https://facebookincubator.github.io/velox/_images/array-vector.png > > > > > > < > > > > > https://facebookincubator.github.io/velox/_images/array-vector.png > > > > > > I'm > > > > currently > > > > working > > > > on > > > > a > > > > C++ > > > > implementation > > > > and > > > > > > plan > > > > to > > > > work > > > > > on > > a > > > > > > Go > > > > > > > > implementation > > > > to > > > > fulfill > > > > the > > > > two-implementations > > > > > > requirement > > > > for > > > > > > format > > > > > > > > changes. > > > > The > > > > draft > > > > design: > > > > - > > > > 3 > > > > buffers: > > > > [validity_bitmap, > > > > int32 > > > > offsets > > > > > > buffer, > > > > int32 > > > > sizes > > > > > > buffer] > > > > > > > > - > > > > 1 > > > > child > > > > array: > > > > "values" > > > > as > > > > an > > > > array > > > > of > > > > the > > > > type > > > > > > parameter > > > > > > validity_bitmap > > > > is > > > > used > > > > to > > > > differentiate > > > > between > > > > > > empty > > > > array > > > > > > views > > > > > > > > (sizes[i] > > > > == > > > > 0) > > > > and > > > > NULL > > > > array > > > > views > > > > > > (validity_bitmap[i] > > > > == > > > > 0). > > > > > > When > > > > the > > > > validity_bitmap[i] > > > > is > > > > 0, > > > > both > > > > sizes > > > > and > > > > > > offsets > > > > are > > > > > > undefined > > > > > > > > (as > > > > > > > > usual), > > > > and > > > > when > > > > sizes[i] > > > > == > > > > 0, > > > > offsets[i] > > > > is > > > > > > undefined. 0 > > > > is > > > > > > recommended > > > > > > > > if > > > > setting > > > > a > > > > value > > > > is > > > > not > > > > an > > > > issue > > > > to > > > > the > > > > system > > > > > > producing > > > > the > > > > > > arrays. > > > > > > > > offsets > > > > buffer > > > > is > > > > not > > > > required > > > > to > > > > be > > > > ordered > > > > and > > > > > > views > > > > don't > > > > have > > > > > > to > > > > > > > > be > > > > > > > > disjoint. > > > > [1] > > > > > > > > > https://facebookincubator.github.io/velox/develop/vectors.html#arrayvector > > > > > > < > > > > > > > https://facebookincubator.github.io/velox/develop/vectors.html#arrayvector > > > > > > > > Thanks, > > > > Felipe > > > > O. > > > > Carvalho > > > > >