Re: [DISCUSS][Format] Draft implementation of string view array format

Benjamin Kietzman Thu, 15 Jun 2023 09:03:50 -0700

Hello again all,

The PR [1] to add string view to the format and the C++ implementation is
hovering around passing CI and has been undrafted. Furthermore, there is
now also a PR [2] to add string view to the Go implementation. Code review
is underway for each PR and I'd like to move toward a vote for acceptance-
are there any other preliminaries which I've neglected?


To reiterate the answers to some past questions:
- Benchmarks are added in the C++ PR [1] to demonstrate the performance of
  conversion between the various string formats. In addition, there are
  some benchmarks which demonstrate the performance gains available with
  the new format [3].
- Adding string view to the C ABI is a natural follow up, but should be
  handled independently. An issue has been added to track that
  enhancement [4].

Sincerely,
Ben Kietzman

[1] https://github.com/apache/arrow/pull/35628
[2] https://github.com/apache/arrow/pull/35769
[3] https://github.com/apache/arrow/pull/35628#issuecomment-1583218617
[4] https://github.com/apache/arrow/issues/36099

On Wed, May 17, 2023 at 12:53 PM Benjamin Kietzman <[email protected]>
wrote:

> @Jacob
> > You mention benchmarks multiple times, are these results published
> somewhere?
>
> I benchmarked the performance of raw pointer vs index offset views in my
> PR to velox,
> I do intend to port them to my arrow PR but I haven't gotten there yet.
> Furthermore, it
> seemed less urgent to me since coexistence of the two types in the c++
> implementation
> defers the question of how aggressively one should be preferred over the
> other.
>
> @Dewey
> > I don't see the C Data interface in the PR
>
> I have not addressed the C ABI in this PR. As you mention, it may be
> useful to transmit
> arrays with raw pointer views between implementations which allow them. I
> can address
> this in a follow up PR.
>
> @Will
> > If I understand correctly, multiple arrays can reference the same buffers
> > in memory, but once they are written to IPC their data buffers will be
> > duplicated. Is that right?
> The buffers will be duplicated. If buffer duplication is becomes a
> concern, I'd prefer to handle
> that in the ipc writer. Then buffers which are duplicated could be
> detected by checking
> pointer identity and written only once.
>
>
> On Wed, May 17, 2023 at 12:07 AM Will Jones <[email protected]>
> wrote:
>
>> Hello Ben,
>>
>> Thanks for your work on this. I think this will be an excellent addition
>> to
>> the format.
>>
>> If I understand correctly, multiple arrays can reference the same buffers
>> in memory, but once they are written to IPC their data buffers will be
>> duplicated. Is that right?
>>
>> Dictionary types have a special message so they can be reused across
>> batches and even fields. Did we consider adding a similar message for
>> string view buffers?
>>
>> One relevant use case I'm curious about is substring extraction. For
>> example, if I have a column of URIs and I create columns where I've
>> extracted substrings like the hostname, path, and a list column of query
>> parameters, I'd like for those latter columns to be views into the URI
>> buffers, rather than full copies.
>>
>> However, I've never touched the IPC read code paths, so it's quite
>> possible
>> I'm overlooking something obvious.
>>
>> Best,
>>
>> Will Jones
>>
>>
>> On Tue, May 16, 2023 at 6:29 PM Dewey Dunnington
>> <[email protected]> wrote:
>>
>> > Very cool!
>> >
>> > In addition to performance mentioned above, I could see this being
>> > useful for the R bindings - we already have a global string pool and a
>> > mechanism for keeping a vector of them alive.
>> >
>> > I don't see the C Data interface in the PR although I may have missed
>> > it - is that a part of the proposal? It seems like it would be
>> > possible to use raw pointers as long as they can be guaranteed to be
>> > valid until the release callback is called?
>> >
>> > On Tue, May 16, 2023 at 8:43 PM Jacob Wujciak
>> > <[email protected]> wrote:
>> > >
>> > > Hello Everyone,
>> > > I think keeping interoperability with the large ecosystem is a very
>> > > important goal for arrow so I am overall in favor of this proposal!
>> > >
>> > > You mention benchmarks multiple times, are these results published
>> > > somewhere?
>> > >
>> > > Thanks
>> > >
>> > > On Tue, May 16, 2023 at 11:39 PM Benjamin Kietzman <
>> [email protected]>
>> > > wrote:
>> > >
>> > > > Hello all,
>> > > >
>> > > > As previously discussed on this list [1], an UmbraDB/DuckDB/Velox
>> > > > compatible
>> > > > "string view" type could bring several performance benefits to
>> access
>> > and
>> > > > authoring of string data in the arrow format [2]. Additionally
>> better
>> > > > interoperability with engines already using this format could be
>> > > > established.
>> > > >
>> > > > PR #0 [3] adds Utf8View and BinaryView types to the C++
>> implementation
>> > and
>> > > > to
>> > > > the IPC format. For the purposes of IPC raw pointers are not used.
>> > Instead,
>> > > > each view contains a pair of 32 bit unsigned integers which encode
>> the
>> > > > index of
>> > > > a character buffer (string view arrays may consist of a variable
>> > number of
>> > > > such buffers) and the offset of a view's data within that buffer
>> > > > respectively.
>> > > > Benefits of this substitution include:
>> > > > - This makes explicit the guarantee that lifetime of all character
>> > data is
>> > > > equal
>> > > >   to that of the array which views it, which is critical for
>> confident
>> > > >   consumption across an interface boundary.
>> > > > - As with other types in the arrow format, such arrays are
>> > serializable and
>> > > >   venue agnostic; directly usable in shared memory without
>> > modification.
>> > > > - Indices and offsets are easily validated.
>> > > >
>> > > > Accessing the data requires some trivial pointer arithmetic, but in
>> > > > benchmarking
>> > > > this had negligible impact on sequential access and only minor
>> impact
>> > on
>> > > > random
>> > > > access.
>> > > >
>> > > > In the C++ implementation, raw pointer string views are supported
>> as an
>> > > > extended
>> > > > case of the Utf8View type: `utf8_view(/*has_raw_pointers=*/true)`.
>> > > > Branching on
>> > > > this access pattern bit at the data type level has negligible
>> impact on
>> > > > performance since the branch resides outside any hot loops. Utility
>> > > > functions
>> > > > are provided for efficient (potentially in-place) conversion between
>> > raw
>> > > > pointer
>> > > > and index offset views. For example, the C++ implementation could
>> zero
>> > copy
>> > > > a raw pointer array from Velox, filter it, then convert to
>> > index/offset for
>> > > > serialization. Other implementations may choose to accommodate or
>> > eschew
>> > > > raw
>> > > > pointer views as their communities direct.
>> > > >
>> > > > Where desirous in a rigorously controlled context this still enables
>> > > > construction
>> > > > and safe consumption of string view arrays which reference memory
>> not
>> > > > directly bound to the lifetime of the array. I'm not sure when or
>> if we
>> > > > would
>> > > > find it useful to have arrays like this; I do not introduce any in
>> > [3]. I
>> > > > mention
>> > > > this possibility to highlight that if benchmarking demonstrates that
>> > such
>> > > > an
>> > > > approach brings a significant performance benefit to some operation,
>> > the
>> > > > only
>> > > > barrier to its adoption would be code review. Likewise if more
>> > intensive
>> > > > benchmarking determines that raw pointer views critically outperform
>> > > > index/offset
>> > > > views for real-world analytics tasks, prioritizing raw pointer
>> string
>> > views
>> > > > for usage within the C++ implementation will be straightforward.
>> > > >
>> > > > See also the proposal to Velox that their string view vector be
>> > refactored
>> > > > in a similar vein [4].
>> > > >
>> > > > Sincerely,
>> > > > Ben Kietzman
>> > > >
>> > > > [1]
>> https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
>> > > > [2] http://cidrdb.org/cidr2020/papers/p29-neumann-cidr20.pdf
>> > > > [3] https://github.com/apache/arrow/pull/35628
>> > > > [4] https://github.com/facebookincubator/velox/discussions/4362
>> > > >
>> >
>>
>

Re: [DISCUSS][Format] Draft implementation of string view array format

Reply via email to