For what it's worth, my company is building a database using arrow(rs) as an in memory storage format, and this feature would be very helpful because it would allow us to bitmask out mvcc rows that have been deleted / have not yet been committed / have been rolled back, etc.
- Brent On Mon, May 15, 2023, 06:55 Ian Cook <i...@ursacomputing.com> wrote: > I think it would be easier for us all to weigh the costs and benefits > of adding this proposed ListView layout to the Arrow specification and > implementing it in the various Arrow libraries if we could all see > some benchmarks demonstrating the performance/efficiency benefits > compared to Arrow’s existing List layouts. > > Based on the Velox paper [1] and from conversations with the Velox > developers, I would anticipate that these benchmarks will show that > ListView confers substantial performance/efficiency benefits on some > workloads. I suggest conferring with the Velox developers to identify > benchmark workloads will best demonstrate the performance/efficiency > benefit of the ListView layout while representing common real-world > workloads. > > Ian > > [1] https://vldb.org/pvldb/vol15/p3372-pedreira.pdf > > On Sat, May 13, 2023 at 3:09 PM Andrew Lamb <al...@influxdata.com> wrote: > > > > I agree that it is hard to see any compelling advantage of adopting > > ListView that would incentivize adding it to DataFusion. > > > > It also seems like the conversion requires changing only indexes (not the > > underlying data) so it would likely be relatively inexpensive I would > think > > > > On Thu, May 11, 2023 at 4:51 PM Raphael Taylor-Davies > > <r.taylordav...@googlemail.com.invalid> wrote: > > > > > Hi All, > > > > > > > if we added this, do we think many Arrow and query > > > > engine implementations (for example, DataFusion) will be eager to add > > > full > > > > support for the type, including compute kernels? Or are they likely > to > > > just > > > > convert this type to ListArray at import boundaries? > > > I can't speak for query engines in general, but at least for arrow-rs > > > and by extension DataFusion, and based on my current understanding of > > > the use-cases I would be rather hesitant to add support to the kernels > > > for this array type, definitely instead favouring conversion at the > > > edges. We already have issues with the amount of code generation > > > resulting in binary bloat and long compile times, and I worry this > would > > > worsen this situation whilst not really providing compelling advantages > > > for the vast majority of workloads that don't interact with Velox. > > > Whilst I can definitely see that the ListView representation is > probably > > > a better way to represent variable length lists than what arrow settled > > > upon, I'm not yet convinced it is sufficiently better to incentivise > > > broad ecosystem adoption. > > > > > > Kind Regards, > > > > > > Raphael Taylor-Davies > > > > > > On 11/05/2023 21:20, Will Jones wrote: > > > > Hi Felipe, > > > > > > > > Thanks for the additional details. > > > > > > > > > > > >> Velox kernels benefit from being able to append data to the array > from > > > >> different threads without care for strict ordering. Only the offsets > > > array > > > >> has to be written according to logical order but that is > potentially a > > > much > > > >> smaller buffer than the values buffer. > > > >> > > > > It still seems to me like applications are still pretty niche, as I > > > suspect > > > > in most cases the benefits are outweighed by the costs. The benefit > here > > > > seems pretty limited: if you are trying to split work between > threads, > > > > usually you will have other levels such as array chunks to > parallelize. > > > And > > > > if you have an incoming stream of row data, you'll want to append in > > > > predictable order to match the order of the other arrays. Am I > missing > > > > something? > > > > > > > > And, IIUC, the cost of using ListView with out-of-order values over > > > > ListArray is you lose memory locality; the values of element 2 are no > > > > longer adjacent to the values of element 1. What do you think about > that > > > > tradeoff? > > > > > > > > I don't mean to be difficult about this. I'm excited for both the > REE and > > > > StringView arrays, but this one I'm not so sure about yet. I suppose > > > what I > > > > am trying to ask is, if we added this, do we think many Arrow and > query > > > > engine implementations (for example, DataFusion) will be eager to add > > > full > > > > support for the type, including compute kernels? Or are they likely > to > > > just > > > > convert this type to ListArray at import boundaries? > > > > > > > > Because if it turns out to be the latter, then we might as well ask > Velox > > > > to export this type as ListArray and save the rest of the ecosystem > some > > > > work. > > > > > > > > Best, > > > > > > > > Will Jones > > > > > > > > On Thu, May 11, 2023 at 12:32 PM Felipe Oliveira Carvalho < > > > > felipe...@gmail.com> wrote: > > > > > > > >> Initial reason for ListView arrays in Arrow is zero-copy > compatibility > > > with > > > >> Velox which uses this format. > > > >> > > > >> Velox kernels benefit from being able to append data to the array > from > > > >> different threads without care for strict ordering. Only the offsets > > > array > > > >> has to be written according to logical order but that is > potentially a > > > much > > > >> smaller buffer than the values buffer. > > > >> > > > >> Acero kernels could take advantage of that in the future. > > > >> > > > >> In implementing ListViewArray/Type I was able to reuse some C++ > > > templates > > > >> used for ListArray which can reduce some of the burden on kernel > > > >> implementations that aim to work with all the types. > > > >> > > > >> I’m can fix Acero kernels for working with ListView. This is > similar to > > > the > > > >> work I’ve doing in kernels dealing with run-end encoded arrays. > > > >> > > > >> — > > > >> Felipe > > > >> > > > >> > > > >> On Wed, 26 Apr 2023 at 01:03 Will Jones <will.jones...@gmail.com> > > > wrote: > > > >> > > > >>> I suppose one common use case is materializing list columns after > some > > > >>> expanding operation like a join or unnest. That's a case where I > could > > > >>> imagine a lot of repetition of values. Haven't yet thought of > common > > > >> cases > > > >>> where there is overlap but not full duplication, but am eager to > hear > > > >> any. > > > >>> The dictionary encoding point Raphael makes is interesting, > especially > > > >>> given the existence of LargeList and FixedSizeList. For many > > > operations, > > > >> it > > > >>> might make more sense to just compose those existing types. > > > >>> > > > >>> IIUC the operations that would be unique to the ArrayView are ones > > > >> altering > > > >>> the shape. One could truncate each array to a certain length > cheaply > > > >> simply > > > >>> by replacing the sizes buffer. Or perhaps there are interesting > > > >> operations > > > >>> on tensors that would benefit. > > > >>> > > > >>> On Tue, Apr 25, 2023 at 7:47 PM Raphael Taylor-Davies > > > >>> <r.taylordav...@googlemail.com.invalid> wrote: > > > >>> > > > >>>> Unless I am missing something, I think the selection use-case > could be > > > >>>> equally well served by a dictionary-encoded BinarArray/ListArray, > and > > > >>> would > > > >>>> have the benefit of not requiring any modifications to the > existing > > > >>> format > > > >>>> or kernels. > > > >>>> > > > >>>> The major additional flexibility of the proposed encoding would be > > > >>>> permitting disjoint or overlapping ranges, are these common > enough in > > > >>>> practice to represent a meaningful bottleneck? > > > >>>> > > > >>>> > > > >>>> On 26 April 2023 01:40:14 BST, David Li <lidav...@apache.org> > wrote: > > > >>>>> Is there a need for a 64-bit offsets version the same way we have > > > List > > > >>>> and LargeList? > > > >>>>> And just to be clear, the difference with List is that the lists > > > don't > > > >>>> have to be stored in their logical order (or in other words, > offsets > > > do > > > >>> not > > > >>>> have to be nondecreasing and so we also need sizes)? > > > >>>>> On Wed, Apr 26, 2023, at 09:37, Weston Pace wrote: > > > >>>>>> For context, there was some discussion on this back in [1]. At > that > > > >>>> time > > > >>>>>> this was called "sequence view" but I do not like that name. > > > >> However, > > > >>>>>> array-view array is a little confusing. Given this is similar > to > > > >> list > > > >>>> can > > > >>>>>> we go with list-view array? > > > >>>>>> > > > >>>>>>> Thanks for the introduction. I'd be interested to hear about > the > > > >>>>>>> applications Velox has found for these vectors, and in what > > > >>> situations > > > >>>>>> they > > > >>>>>>> are useful. This could be contrasted with the current ListArray > > > >>>>>>> implementations. > > > >>>>>> I believe one significant benefit is that take (and by proxy, > > > >> filter) > > > >>>> and > > > >>>>>> sort are O(# of items) with the proposed format and O(# of > bytes) > > > >> with > > > >>>> the > > > >>>>>> current format. Jorge did some profiling to this effect in [1]. > > > >>>>>> > > > >>>>>> [1] > > > >> https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq > > > >>>>>> On Tue, Apr 25, 2023 at 3:13 PM Will Jones < > will.jones...@gmail.com > > > >>>> wrote: > > > >>>>>>> Hi Felipe, > > > >>>>>>> > > > >>>>>>> Thanks for the introduction. I'd be interested to hear about > the > > > >>>>>>> applications Velox has found for these vectors, and in what > > > >>> situations > > > >>>> they > > > >>>>>>> are useful. This could be contrasted with the current ListArray > > > >>>>>>> implementations. > > > >>>>>>> > > > >>>>>>> IIUC it would be fairly cheap to transform a ListArray to an > > > >>>> ArrayView, but > > > >>>>>>> expensive to go the other way. > > > >>>>>>> > > > >>>>>>> Best, > > > >>>>>>> > > > >>>>>>> Will Jones > > > >>>>>>> > > > >>>>>>> On Tue, Apr 25, 2023 at 3:00 PM Felipe Oliveira Carvalho < > > > >>>>>>> felipe...@gmail.com> wrote: > > > >>>>>>> > > > >>>>>>>> Hi folks, > > > >>>>>>>> > > > >>>>>>>> I would like to start a public discussion on the inclusion of > a > > > >> new > > > >>>> array > > > >>>>>>>> format to Arrow — array-view array. The name is also up for > > > >> debate. > > > >>>>>>>> This format is inspired by Velox's ArrayVector format [1]. > > > >>> Logically, > > > >>>>>>> this > > > >>>>>>>> array represents an array of arrays. Each element is an > > > >> array-view > > > >>>>>>> (offset > > > >>>>>>>> and size pair) that points to a range within a nested "values" > > > >>> array > > > >>>>>>>> (called "elements" in Velox docs). The nested array can be of > any > > > >>>> type, > > > >>>>>>>> which makes this format very flexible and powerful. > > > >>>>>>>> > > > >>>>>>>> [image: ../_images/array-vector.png] > > > >>>>>>>> < > > > >>> https://facebookincubator.github.io/velox/_images/array-vector.png > > > > > >>>>>>>> I'm currently working on a C++ implementation and plan to work > > > >> on a > > > >>>> Go > > > >>>>>>>> implementation to fulfill the two-implementations requirement > for > > > >>>> format > > > >>>>>>>> changes. > > > >>>>>>>> > > > >>>>>>>> The draft design: > > > >>>>>>>> > > > >>>>>>>> - 3 buffers: [validity_bitmap, int32 offsets buffer, int32 > sizes > > > >>>> buffer] > > > >>>>>>>> - 1 child array: "values" as an array of the type parameter > > > >>>>>>>> > > > >>>>>>>> validity_bitmap is used to differentiate between empty array > > > >> views > > > >>>>>>>> (sizes[i] == 0) and NULL array views (validity_bitmap[i] == > 0). > > > >>>>>>>> > > > >>>>>>>> When the validity_bitmap[i] is 0, both sizes and offsets are > > > >>>> undefined > > > >>>>>>> (as > > > >>>>>>>> usual), and when sizes[i] == 0, offsets[i] is undefined. 0 is > > > >>>> recommended > > > >>>>>>>> if setting a value is not an issue to the system producing the > > > >>>> arrays. > > > >>>>>>>> offsets buffer is not required to be ordered and views don't > have > > > >>> to > > > >>>> be > > > >>>>>>>> disjoint. > > > >>>>>>>> > > > >>>>>>>> [1] > > > >>>>>>>> > > > >> > > > > https://facebookincubator.github.io/velox/develop/vectors.html#arrayvector > > > >>>>>>>> Thanks, > > > >>>>>>>> Felipe O. Carvalho > > > >>>>>>>> > > > >