> I am -0.5 in adding it without removing the > [Large]Utf8Array / Binary / List
I'm not sure about dropping List. Is SequenceView semantically equivalent to List / FixedSizeList? In other words, is SequenceView a nested type? The document seems to suggest it is but the use case you described does not. For example, in the C++ compute today you cannot add List<INT32> + List<INT32> but I think you would want to be able to add SequenceView<INT32> + SequenceView<INT32>. Also, the size of a List<INT32> is the # of lists and not the # of items. For a SequenceView I think the size of the array would be the number of items. I would also consider it a semantic change to go from Struct{"x": INT32} to Struct{"X": List<INT32>}. >From the use case it sounds more like SequenceView would be similar to dictionary and RLE, a different encoding for existing arrays. However, it is possible I am misreading things. On Wed, Dec 15, 2021 at 10:49 AM Jorge Cardoso Leitão <jorgecarlei...@gmail.com> wrote: > > Hi, > > Thanks a lot for this initiative and the write up. > > I did a small bench for the sequence view and added a graph to the document > for evidence of what Wes is writing wrt to performance of "selection / take > / filter". > > Big +1 in replacing our current representation of variable-sized arrays by > the "sequence view". atm I am -0.5 in adding it without removing the > [Large]Utf8Array / Binary / List, as I see the advantages as sufficiently > large to break compatibility and deprecate the previous representations > (and do not enjoy maintaining multiple similar representations that solve > very similar problems). > > Likewise, +1 for the RLE and -0.5 for the constant array, as the latter > seems redundant to me (it is an RLE). > > Wrt to the string view: would like to run some benches on that too. Could > someone clarify what are the "good cases" for that one? > > More generally, I second the point made by Antoine: there is already some > fragmentation over the types in the official implementations (see [1]), and > we do not even have a common integration test suite for the c data > interface. One approach to this dimension is to *deprecate* > representations, which goes into the direction mentioned above. > > Wrt to design, we could consider a separate enum for the RLE vs plain > encoding, as they are not really semantic types (the dictionary is also not > a semantic type but it is represented as one in at least the Rust > implementation, unfortunately). > > Wrt to Rust impl in particular, I do not think that the String View poses a > problem - Rust can layout according to the C representation. Here [2] is > the corresponding Rust code of the struct in the doc (generated via Rust's > bindgen [3]). > > Thanks again for this, looking very much forward to it! > > [1] > https://github.com/apache/arrow/blob/master/dev/archery/archery/integration/datagen.py#L1546 > [2] > https://github.com/DataEngineeringLabs/arrow-string-view/blob/main/src/string_view.rs > [3] https://rust-lang.github.io/rust-bindgen/command-line-usage.html > > > On Wed, Dec 15, 2021 at 3:15 AM Wes McKinney <wesmck...@gmail.com> wrote: > > > Ultimately, the problem comes down to providing a means of O(# > > records) selection (take, filter) performance and memory use for > > non-numeric data (strings, arrays, maps, etc.). > > > > DuckDB and Velox are two projects which have designed themselves to be > > very nearly Arrow-compatible but have implemented alternative memory > > layouts to achieve O(# records) selections on all data types. I am > > proposing to adopt these innovations as additional memory layouts in > > Arrow with a target of zero-copy across the C ABI — how exactly they > > are translated to the IPC format seems less of an immediate benefit > > than enabling the in-memory performance/memory use optimization since > > query engines can accelerate performance with faster selections. If > > there are some alternative proposals to achieve O(# records) time and > > space complexity for selection operations, let's definitely look at > > them. > > > > > > On Tue, Dec 14, 2021 at 8:02 PM Weston Pace <weston.p...@gmail.com> wrote: > > > > > > Would it be simpler to change the spec so that child arrays can be > > > chunked? This might reduce the data type growth and make the intent > > > more clear. > > > > > > This will add another dimension to performance analysis. We pretty > > > regularly get issues/tickets from users that have unknowingly created > > > parquet files with poor row group resolution (e.g. 50 rows per row > > > group) and experience rotten performance as a result. I suspect > > > something similar could happen here. It sounds like arrays will > > > naturally subdivide over time. Users might start seeing poor > > > performance without realizing the root cause is because their 1 > > > million element array has been split into 10,000 allocations of 100 > > > elements. However, I suspect this is something that could be managed > > > with visibility and recompaction utilities. > > > > > > > > > On Tue, Dec 14, 2021 at 1:22 PM Wes McKinney <wesmck...@gmail.com> > > wrote: > > > > > > > > hi folks, > > > > > > > > A few things in the general discussion, before certain things will > > > > have to be split off into their own dedicated discussions. > > > > > > > > It seems that I didn't do a very good job of motivating the "sequence > > > > view" type. Let me take a step back and discuss one of the problems > > > > these new memory layouts are solving. > > > > > > > > In Arrow currently, selection operations ("take", "filter", or > > > > indirect sort — the equivalent of arr.take(argsort(something_else)) if > > > > you're coming from NumPy) have time complexity proportional to the > > > > number of records for primitive types and complexity proportional to > > > > the greater of max(# records, memory size) for nested types. > > > > > > > > So, for example: > > > > > > > > * Take(arr, indices) has O(# records) complexity for primitive types > > > > and does O(# records) memory allocation > > > > * Take(arr, indices) has O(max(# records, size of memory buffers / > > > > child arrays)) complexity for strings and nested types and does O(size > > > > of memory buffers) memory allocation > > > > > > > > This means that columnar query engines that leverage selections can > > > > experience heavy costs both in time complexity and memory use when > > > > doing selections on non-primitive array data. Selections may arise > > > > from filtering or sorting or other operations. > > > > > > > > The "String view" and "Sequence view" memory layouts in this document > > > > do not have this problem. When using these for strings and nested > > > > data, they have the same time complexity and memory allocation > > > > behavior for selections as primitive types, and the "child" memory > > > > buffers do not have to be manipulated or rebuilt at all. This has > > > > significant performance benefits and reduced memory use. > > > > > > > > Additionally, the string view and sequence view layouts solve the > > > > problem of out-of-order construction. As has been pointed out, one way > > > > to work around this issue at present is to use "chunked arrays". > > > > However, this means that you cannot ever use thread parallelism in the > > > > construction of non-chunked outputs with nested data (for example, in > > > > expression evaluation) — if a nested array forms part of a record > > > > batch, then either you must stick to single-threaded execution or use > > > > thread parallelism to subdivide even the other fields of the record > > > > batch that are non-nested to obtain equal-sized arrays across all > > > > fields. For example, if you had a record batch with 32K rows and > > > > wanted to parallelize execution of a projection using 4 threads — you > > > > would need to divide all fields into chunks of 8K each prior to > > > > beginning to produce outputs. This is fairly inflexible. > > > > > > > > As another motivating example, consider a parallel selection operation > > > > (e.g. "take" or "filter") on a nested array. Currently it is not > > > > possible to parallelize at all because of the in-order construction > > > > requirement. > > > > > > > > I don't expect you to just trust me — here is an example: > > > > > > > > https://gist.github.com/wesm/25fc7b877f913c7e4449117178302646 > > > > > > > > In this example, I use Take to permute 1M doubles and 1M strings with > > > > 50 bytes each > > > > > > > > * Doubles: 2.45ms (new memory allocated: 8000000) > > > > * Strings: 39.6ms (new memory allocated: 54000000) > > > > > > > > The performance ratio is 16x even though the memory ratio is only ~7x. > > > > With the "StringView" data type, only 16000000 bytes of new memory > > > > would need to be allocated, and the performance should be only 2-4x > > > > slower than the doubles case (because we only need to relocate a bunch > > > > of 16-byte structs) instead of 16x slower. > > > > > > > > I hope you can see now that this can be a rather serious resource > > > > utilization issue, both in processing time and memory use. I will > > > > update the document to explain this better and work on responding to > > > > some of the other comments. > > > > > > > > Wes > > > > > > > > On Tue, Dec 14, 2021 at 5:08 AM Antoine Pitrou <anto...@python.org> > > wrote: > > > > > > > > > > > > > > > Hello, > > > > > > > > > > I think my main concern is how we can prevent the community from > > > > > fragmenting too much over supported encodings. The more complex the > > > > > encodings, the less likely they are to be supported by all main > > > > > implementations. We see this in Parquet where the efficient "delta" > > > > > encodings have just received support in Parquet C++, and even, only > > on > > > > > the read side. > > > > > > > > > > There is an additional subtlety in that Arrow is not a storage > > mechanism > > > > > but it represents data in memory, so pieces doing computation have > > to be > > > > > adapted to the new encodings, for example the entire library of > > > > > computation kernels in Arrow C++ (of course, an easy but inefficient > > > > > adaptation is to always unpack to an already supported layout). > > > > > > > > > > As an anecdote, the Arrow C++ kernels are supposed to accept a > > selection > > > > > vector to filter their physical inputs, but none actually supports > > it. > > > > > I think we should be wary of adding ambitious new features that might > > > > > never get an actual implementation. > > > > > > > > > > > > > > > On the detail of the proposed encodings: > > > > > > > > > > - I hope we can avoid storing raw pointers instead of offsets into a > > > > > separate buffer; I understand the flexibility argument for pointers > > but > > > > > it will also make data transfer more complicated > > > > > > > > > > - Constant arrays are a special case of RLE arrays and I'm not sure > > > > > doing both is really useful > > > > > > > > > > - I don't really understand the concrete use case for the weird > > > > > "sequence view" layout; I'll note that non-monotonic offsets can make > > > > > linear traversal less efficient, since the CPU won't automatically > > > > > prefetch data for you > > > > > > > > > > - The proposed RLE encoding seems inefficient; usually, RLE encodings > > > > > try hard to minimize the size overhead of RLE sequences, such that > > they > > > > > become beneficial even for very short repeated runs > > > > > > > > > > Regards > > > > > > > > > > Antoine. > > > > > > > > > > > > > > > > > > > > > > > > > Le 10/12/2021 à 20:28, Wes McKinney a écrit : > > > > > > > > > > > > This topic may provoke , but, given that Arrow is approaching its > > > > > > 6-year anniversary, I think this is an important discussion about > > how > > > > > > we can thoughtfully expand the Arrow specifications to support > > > > > > next-generation columnar data processing. In recent times, I have > > been > > > > > > motivated by recent interactions with CWI's DuckDB and Meta's Velox > > > > > > open source projects and the innovations they've made around data > > > > > > representation providing beneficial features above and beyond what > > we > > > > > > have already in Arrow. For example, they have a 16-byte "string > > view" > > > > > > data type that enables buffer memory reuse, faster "false" > > comparisons > > > > > > on strings unequal in the first 4 bytes, and inline small strings. > > > > > > Both the Rust and C++ query engine efforts could potentially > > benefit > > > > > > from this (not sure about the memory safety implications in Rust, > > > > > > comments around this would be helpful). > > > > > > > > > > > > I wrote a document to start a discussion about a few new ways to > > > > > > represent data that may help with building > > > > > > Arrow-native/Arrow-compatible query engines: > > > > > > > > > > > > > > https://docs.google.com/document/d/12aZi8Inez9L_JCtZ6gi2XDbQpCsHICNy9_EUxj4ILeE/edit# > > > > > > > > > > > > Each of these potential additions would need to be eventually split > > > > > > off into independent efforts with associated additions to the > > columnar > > > > > > specification, IPC format, C ABI, integration tests, and so on. > > > > > > > > > > > > The document is open to anyone to comment but if anyone would like > > > > > > edit access please feel free to request and I look forward to the > > > > > > discussion. > > > > > > > > > > > > Thanks, > > > > > > Wes > > > > > > > >