Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

Weston Pace Wed, 15 Dec 2021 13:35:38 -0800

> I am -0.5 in adding it without removing the
> [Large]Utf8Array / Binary / List


I'm not sure about dropping List.

Is SequenceView semantically equivalent to List / FixedSizeList?  In
other words, is SequenceView a nested type?  The document seems to
suggest it is but the use case you described does not.  For example,
in the C++ compute today you cannot add List<INT32> + List<INT32> but
I think you would want to be able to add SequenceView<INT32> +
SequenceView<INT32>.  Also, the size of a List<INT32> is the # of
lists and not the # of items.  For a SequenceView I think the size of
the array would be the number of items.  I would also consider it a
semantic change to go from Struct{"x": INT32} to Struct{"X":
List<INT32>}.

>From the use case it sounds more like SequenceView would be similar to
dictionary and RLE, a different encoding for existing arrays.
However, it is possible I am misreading things.

On Wed, Dec 15, 2021 at 10:49 AM Jorge Cardoso Leitão
<jorgecarlei...@gmail.com> wrote:
>
> Hi,
>
> Thanks a lot for this initiative and the write up.
>
> I did a small bench for the sequence view and added a graph to the document
> for evidence of what Wes is writing wrt to performance of "selection / take
> / filter".
>
> Big +1 in replacing our current representation of variable-sized arrays by
> the "sequence view". atm I am -0.5 in adding it without removing the
> [Large]Utf8Array / Binary / List, as I see the advantages as sufficiently
> large to break compatibility and deprecate the previous representations
> (and do not enjoy maintaining multiple similar representations that solve
> very similar problems).
>
> Likewise, +1 for the RLE and -0.5 for the constant array, as the latter
> seems redundant to me (it is an RLE).
>
> Wrt to the string view: would like to run some benches on that too. Could
> someone clarify what are the "good cases" for that one?
>
> More generally, I second the point made by Antoine: there is already some
> fragmentation over the types in the official implementations (see [1]), and
> we do not even have a common integration test suite for the c data
> interface. One approach to this dimension is to *deprecate*
> representations, which goes into the direction mentioned above.
>
> Wrt to design, we could consider a separate enum for the RLE vs plain
> encoding, as they are not really semantic types (the dictionary is also not
> a semantic type but it is represented as one in at least the Rust
> implementation, unfortunately).
>
> Wrt to Rust impl in particular, I do not think that the String View poses a
> problem - Rust can layout according to the C representation. Here [2] is
> the corresponding Rust code of the struct in the doc (generated via Rust's
> bindgen [3]).
>
> Thanks again for this, looking very much forward to it!
>
> [1]
> https://github.com/apache/arrow/blob/master/dev/archery/archery/integration/datagen.py#L1546
> [2]
> https://github.com/DataEngineeringLabs/arrow-string-view/blob/main/src/string_view.rs
> [3] https://rust-lang.github.io/rust-bindgen/command-line-usage.html
>
>
> On Wed, Dec 15, 2021 at 3:15 AM Wes McKinney <wesmck...@gmail.com> wrote:
>
> > Ultimately, the problem comes down to providing a means of O(#
> > records) selection (take, filter) performance and memory use for
> > non-numeric data (strings, arrays, maps, etc.).
> >
> > DuckDB and Velox are two projects which have designed themselves to be
> > very nearly Arrow-compatible but have implemented alternative memory
> > layouts to achieve O(# records) selections on all data types. I am
> > proposing to adopt these innovations as additional memory layouts in
> > Arrow with a target of zero-copy across the C ABI — how exactly they
> > are translated to the IPC format seems less of an immediate benefit
> > than enabling the in-memory performance/memory use optimization since
> > query engines can accelerate performance with faster selections. If
> > there are some alternative proposals to achieve O(# records) time and
> > space complexity for selection operations, let's definitely look at
> > them.
> >
> >
> > On Tue, Dec 14, 2021 at 8:02 PM Weston Pace <weston.p...@gmail.com> wrote:
> > >
> > > Would it be simpler to change the spec so that child arrays can be
> > > chunked?  This might reduce the data type growth and make the intent
> > > more clear.
> > >
> > > This will add another dimension to performance analysis.  We pretty
> > > regularly get issues/tickets from users that have unknowingly created
> > > parquet files with poor row group resolution (e.g. 50 rows per row
> > > group) and experience rotten performance as a result.  I suspect
> > > something similar could happen here.  It sounds like arrays will
> > > naturally subdivide over time.  Users might start seeing poor
> > > performance without realizing the root cause is because their 1
> > > million element array has been split into 10,000 allocations of 100
> > > elements.  However, I suspect this is something that could be managed
> > > with visibility and recompaction utilities.
> > >
> > >
> > > On Tue, Dec 14, 2021 at 1:22 PM Wes McKinney <wesmck...@gmail.com>
> > wrote:
> > > >
> > > > hi folks,
> > > >
> > > > A few things in the general discussion, before certain things will
> > > > have to be split off into their own dedicated discussions.
> > > >
> > > > It seems that I didn't do a very good job of motivating the "sequence
> > > > view" type. Let me take a step back and discuss one of the problems
> > > > these new memory layouts are solving.
> > > >
> > > > In Arrow currently, selection operations ("take", "filter", or
> > > > indirect sort — the equivalent of arr.take(argsort(something_else)) if
> > > > you're coming from NumPy) have time complexity proportional to the
> > > > number of records for primitive types and complexity proportional to
> > > > the greater of max(# records, memory size) for nested types.
> > > >
> > > > So, for example:
> > > >
> > > > * Take(arr, indices) has O(# records) complexity for primitive types
> > > > and does O(# records) memory allocation
> > > > * Take(arr, indices) has O(max(# records, size of memory buffers /
> > > > child arrays)) complexity for strings and nested types and does O(size
> > > > of memory buffers) memory allocation
> > > >
> > > > This means that columnar query engines that leverage selections can
> > > > experience heavy costs both in time complexity and memory use when
> > > > doing selections on non-primitive array data. Selections may arise
> > > > from filtering or sorting or other operations.
> > > >
> > > > The "String view" and "Sequence view" memory layouts in this document
> > > > do not have this problem. When using these for strings and nested
> > > > data, they have the same time complexity and memory allocation
> > > > behavior for selections as primitive types, and the "child" memory
> > > > buffers do not have to be manipulated or rebuilt at all. This has
> > > > significant performance benefits and reduced memory use.
> > > >
> > > > Additionally, the string view and sequence view layouts solve the
> > > > problem of out-of-order construction. As has been pointed out, one way
> > > > to work around this issue at present is to use "chunked arrays".
> > > > However, this means that you cannot ever use thread parallelism in the
> > > > construction of non-chunked outputs with nested data (for example, in
> > > > expression evaluation) — if a nested array forms part of a record
> > > > batch, then either you must stick to single-threaded execution or use
> > > > thread parallelism to subdivide even the other fields of the record
> > > > batch that are non-nested to obtain equal-sized arrays across all
> > > > fields. For example, if you had a record batch with 32K rows and
> > > > wanted to parallelize execution of a projection using 4 threads — you
> > > > would need to divide all fields into chunks of 8K each prior to
> > > > beginning to produce outputs. This is fairly inflexible.
> > > >
> > > > As another motivating example, consider a parallel selection operation
> > > > (e.g. "take" or "filter") on a nested array. Currently it is not
> > > > possible to parallelize at all because of the in-order construction
> > > > requirement.
> > > >
> > > > I don't expect you to just trust me — here is an example:
> > > >
> > > > https://gist.github.com/wesm/25fc7b877f913c7e4449117178302646
> > > >
> > > > In this example, I use Take to permute 1M doubles and 1M strings with
> > > > 50 bytes each
> > > >
> > > > * Doubles: 2.45ms (new memory allocated: 8000000)
> > > > * Strings: 39.6ms (new memory allocated: 54000000)
> > > >
> > > > The performance ratio is 16x even though the memory ratio is only ~7x.
> > > > With the "StringView" data type, only 16000000 bytes of new memory
> > > > would need to be allocated, and the performance should be only 2-4x
> > > > slower than the doubles case (because we only need to relocate a bunch
> > > > of 16-byte structs) instead of 16x slower.
> > > >
> > > > I hope you can see now that this can be a rather serious resource
> > > > utilization issue, both in processing time and memory use. I will
> > > > update the document to explain this better and work on responding to
> > > > some of the other comments.
> > > >
> > > > Wes
> > > >
> > > > On Tue, Dec 14, 2021 at 5:08 AM Antoine Pitrou <anto...@python.org>
> > wrote:
> > > > >
> > > > >
> > > > > Hello,
> > > > >
> > > > > I think my main concern is how we can prevent the community from
> > > > > fragmenting too much over supported encodings.  The more complex the
> > > > > encodings, the less likely they are to be supported by all main
> > > > > implementations.  We see this in Parquet where the efficient "delta"
> > > > > encodings have just received support in Parquet C++, and even, only
> > on
> > > > > the read side.
> > > > >
> > > > > There is an additional subtlety in that Arrow is not a storage
> > mechanism
> > > > > but it represents data in memory, so pieces doing computation have
> > to be
> > > > > adapted to the new encodings, for example the entire library of
> > > > > computation kernels in Arrow C++ (of course, an easy but inefficient
> > > > > adaptation is to always unpack to an already supported layout).
> > > > >
> > > > > As an anecdote, the Arrow C++ kernels are supposed to accept a
> > selection
> > > > > vector to filter their physical inputs, but none actually supports
> > it.
> > > > > I think we should be wary of adding ambitious new features that might
> > > > > never get an actual implementation.
> > > > >
> > > > >
> > > > > On the detail of the proposed encodings:
> > > > >
> > > > > - I hope we can avoid storing raw pointers instead of offsets into a
> > > > > separate buffer; I understand the flexibility argument for pointers
> > but
> > > > > it will also make data transfer more complicated
> > > > >
> > > > > - Constant arrays are a special case of RLE arrays and I'm not sure
> > > > > doing both is really useful
> > > > >
> > > > > - I don't really understand the concrete use case for the weird
> > > > > "sequence view" layout; I'll note that non-monotonic offsets can make
> > > > > linear traversal less efficient, since the CPU won't automatically
> > > > > prefetch data for you
> > > > >
> > > > > - The proposed RLE encoding seems inefficient; usually, RLE encodings
> > > > > try hard to minimize the size overhead of RLE sequences, such that
> > they
> > > > > become beneficial even for very short repeated runs
> > > > >
> > > > > Regards
> > > > >
> > > > > Antoine.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > Le 10/12/2021 à 20:28, Wes McKinney a écrit :
> > > > > >
> > > > > > This topic may provoke , but, given that Arrow is approaching its
> > > > > > 6-year anniversary, I think this is an important discussion about
> > how
> > > > > > we can thoughtfully expand the Arrow specifications to support
> > > > > > next-generation columnar data processing. In recent times, I have
> > been
> > > > > > motivated by recent interactions with CWI's DuckDB and Meta's Velox
> > > > > > open source projects and the innovations they've made around data
> > > > > > representation providing beneficial features above and beyond what
> > we
> > > > > > have already in Arrow. For example, they have a 16-byte "string
> > view"
> > > > > > data type that enables buffer memory reuse, faster "false"
> > comparisons
> > > > > > on strings unequal in the first 4 bytes, and inline small strings.
> > > > > > Both the Rust and C++ query engine efforts could potentially
> > benefit
> > > > > > from this (not sure about the memory safety implications in Rust,
> > > > > > comments around this would be helpful).
> > > > > >
> > > > > > I wrote a document to start a discussion about a few new ways to
> > > > > > represent data that may help with building
> > > > > > Arrow-native/Arrow-compatible query engines:
> > > > > >
> > > > > >
> > https://docs.google.com/document/d/12aZi8Inez9L_JCtZ6gi2XDbQpCsHICNy9_EUxj4ILeE/edit#
> > > > > >
> > > > > > Each of these potential additions would need to be eventually split
> > > > > > off into independent efforts with associated additions to the
> > columnar
> > > > > > specification, IPC format, C ABI, integration tests, and so on.
> > > > > >
> > > > > > The document is open to anyone to comment but if anyone would like
> > > > > > edit access please feel free to request and I look forward to the
> > > > > > discussion.
> > > > > >
> > > > > > Thanks,
> > > > > > Wes
> > > > > >
> >

Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

Reply via email to