Re: [DISCUSS][Format] Draft implementation of string view array format

Benjamin Kietzman Thu, 06 Jul 2023 09:47:29 -0700

@Andrew:

Restricting these arrays to a single buffer will severely decrease their
utility. Since the character data is stored in multiple character buffers
writing Utf8View array can proceed without resizing allocations,
which is a major overhead when writing Utf8 arrays. Furthermore since the
character buffers have no restrictions on their size, it's straightforward
to
reuse an existing buffer as a character buffer rather than always allocating
a new one. In the case of creating an array which shares a lot of data with
another (for example, appending some strings) we can reuse most of the
character buffers from the original. Finally Utf8View is well adapted for
efficiently wrapping non-arrow string data for ingestion by a kernel, even
if the string data's full extent is not known ahead of time and is spread
across multiple non-contiguous buffers.


@Raphael:

> branch on access

The branch-on-access is unavoidable since a primary feature of the Utf8View
format is keeping short strings inline in the fixed width portion of data.
It's worth noting that the inline prefix allows skipping the branch entirely
for common cases of comparison, for example when the strings to be compared
differ within the first 4 bytes.

In benchmarking (for example while building a hash table) I have not
observed
that this branch overly pessimizes access. Although I can't guarantee every
Utf8View array will be more efficient than any Utf8 array, it is certainly
faster for many relevant cases. Specifically sorting and equality comparison
benefit significantly from the prefix comparison fast path,
so I'd anticipate that multi column sorting and aggregations would as well.
If there are any other benchmarks which would help to justify Utf8View in
your
mind, I'd be happy to try writing them.

> UTF-8 validation for StringArray can be done very efficiently by first
verifying the entire buffer, and then verifying the offsets correspond to
the start of a UTF-8 codepoint

For non-inlined strings, the character buffers do always contain the entire
string's data and not just the last `len - 4` bytes. Thus the approach you
describe for validating an entire character buffer as UTF-8 then checking
offsets will be just as valid for Utf8View arrays as for Utf8 arrays.

> it does seem inconsistent to use unsigned types

It is indeed more typical for the arrow format to use signed integers for
offsets and other quantities. In this case there is prior art in other
engines with which we can remain compatible by using unsigned integers
instead. Since this is only a break with convention within the format and
shouldn't be difficult for any implementation to accommodate, I would argue
that it's worthwhile to avoid pushing change onto existing implementers.

> I presume that StringView will behave similarly to dictionaries in that
the selection kernels will not recompute the underlying value buffers.

The Utf8View format itself is not prescriptive of selection operations on
the
array; kernels are free to reuse character buffers (which produces an
implicit
selection vector) or to recompute them. Furthermore unlike an explicit
selection vector a kernel may decide to copy and densify dynamically if it
detects that output is getting sparse or fragmented. It's also worth noting
that unlike an explicit selection vector a Utf8View array (however sparse or
fragmented) will still benefit from the prefix comparison fast path.

Sincerely,
Ben Kietzman

On Sun, Jul 2, 2023 at 8:01 AM Raphael Taylor-Davies
<r.taylordav...@googlemail.com.invalid> wrote:

> > I would be interested in hearing some input from the Rust community.
>
>  A couple of thoughts:
>
> The variable number of buffers would definitely pose some challenges for
> the Rust implementation, the closest thing we currently have is possibly
> UnionArray, but even then the number of buffers is still determined
> statically by the DataType. I therefore also wonder about the possibility
> of always having a single backing buffer that stores the character data,
> including potentially a copy of the prefix. This would also avoid forcing a
> branch on access, which I would have expected to hurt performance for some
> kernels quite significantly.
>
> Whilst not really a concern for Rust, which supports unsigned types, it
> does seem inconsistent to use unsigned types where the rest of the format
> encourages the use of signed offsets, etc...
>
> It isn't clearly specified whether a null should have a valid set of
> offsets, etc... I think it is an important property of the current array
> layouts that, with exception to dictionaries, the data in null slots is
> arbitrary, i.e. can take any value, but not undefined. This allows for
> separate handling of the null mask and values, which can be important for
> some kernels and APIs.
>
> More an observation than an issue, but UTF-8 validation for StringArray
> can be done very efficiently by first verifying the entire buffer, and then
> verifying the offsets correspond to the start of a UTF-8 codepoint. This
> same approach would not be possible for StringView, which would need to
> verify individual values and would therefore be significantly more
> expensive. As it is UB for a Rust string to contain non-UTF-8 data, this
> validation is perhaps more important for Rust than for other languages.
>
> I presume that StringView will behave similarly to dictionaries in that
> the selection kernels will not recompute the underlying value buffers. I
> think this is fine, but it is perhaps worth noting this has caused
> confusion in the past, as people somewhat reasonably expect an array
> post-selection to have memory usage reflecting the smaller selection. This
> is then especially noticeable if the data is written out to IPC, and still
> contains data that was supposedly filtered out. My 2 cents is that explicit
> selection vectors are a less surprising way to defer selection than baking
> it into the array, but I also don't have any workloads where this is the
> major bottleneck so can't speak authoritatively here.
>
> Which leads on to my major concern with this proposal, that it adds
> complexity and cognitive load to the specification and implementations,
> whilst not meaningfully improving the performance of the operators that I
> commonly encounter as performance bottlenecks, which are multi-column sorts
> and aggregations, or the expensive string operations such as matching or
> parsing. If we didn't already have a string representation I would be more
> onboard, but as it stands I'm definitely on the fence, especially given
> selection performance can be improved in less intrusive ways using
> dictionaries or selection vectors.
>
> Kind Regards,
>
> Raphael Taylor-Davies
>
> On 02/07/2023 11:46, Andrew Lamb wrote:
>
>  * This is the first layout where the number of buffers depends on the
>
> data
>
> and not the schema. I think this is the most architecturally significant
> fact. I
>
>  I have spent some time reading the initial proposal -- thank you for
> that. I now understand what Weston was saying about the "variable numbers
> of buffers". I wonder if you considered restricting such arrays to a single
> buffer (so as to make them more similar to other arrow array types that
> have a fixed number of buffers)? On Tue, Jun 20, 2023 at 11:33 AM Weston
> Pace <weston.p...@gmail.com> <mailto:weston.p...@gmail.com> wrote:
>
> Before I say anything else I'll say that I am in favor of this new layout.
> There is some existing literature on the idea (e.g. umbra) and your
> benchmarks show some nice improvements. Compared to some of the other
> layouts we've discussed recently (REE, list veiw) I do think this layout is
> more unique and fundamentally different. Perhaps most fundamentally
> different: * This is the first layout where the number of buffers depends
> on the data and not the schema. I think this is the most architecturally
> significant fact. It does require a (backwards compatible) change to the
> IPC format itself, beyond just adding new type codes. It also poses
> challenges in places where we've assumed there will be at most 3 buffers
> (e.g. in ArraySpan, though, as you have shown, we can work around this
> using a raw pointers representation internally in those spots). I think
> you've done some great work to integrate this well with Arrow-C++ and I'm
> convinced it can work. I would be interested in hearing some input from the
> Rust community. Ben, at one point there was some discussion that this might
> be a c-data only type. However, I believe that was based on the raw
> pointers representation. What you've proposed here, if I understand
> correctly, is an index + offsets representation and it is suitable for IPC
> correct? (e.g. I see that you have changes and examples in the IPC
> reader/writer) On Mon, Jun 19, 2023 at 7:17 AM Benjamin Kietzman <
> bengil...@gmail.com> <mailto:bengil...@gmail.com> wrote:
>
> Hi Gang, I'm not sure what you mean, sorry if my answers are off base:
> Parquet's ByteArray will be unaffected by the addition of the string view
> type; all arrow strings (arrow::Type::STRING, arrow::Type::LARGE_STRING,
> and with this patch arrow::Type::STRING_VIEW) are converted to ByteArrays
> during serialization to parquet [1]. If you mean that encoding of
> arrow::Type::STRING_VIEW will not be as fast as encoding of equivalent
> arrow::Type::STRING, that's something I haven't benchmarked so I can't
> answer definitively. I would expect it to be
>
> faster
>
> than first converting STRING_VIEW->STRING then encoding to parquet; direct
> encoding avoids allocating and populating temporary buffers. Of course
>
> this
>
> only applies to cases where you need to encode an array of STRING_VIEW to
> parquet- encoding of STRING to parquet will be unaffected. Sincerely, Ben
> [1]
>
>
> https://github.com/bkietz/arrow/blob/46cf7e67766f0646760acefa4d2d01cdfead2d5d/cpp/src/parquet/encoding.cc#L166-L179
>
>  On Thu, Jun 15, 2023 at 10:34 PM Gang Wu <ust...@gmail.com> <mailto:
> ust...@gmail.com> wrote:
>
> Hi Ben, The posted benchmark [1] looks pretty good to me. However, I want
> to raise a possible issue from the perspective of parquet-cpp. Parquet-cpp
> uses a customized parquet::ByteArray type [2] for string/binary, I
>
> would
>
> expect some regression of conversions between parquet reader/writer and
> the proposed string view array, especially when some strings use short form
> and others use long form. [1]
>
>
> https://github.com/apache/arrow/blob/41309de8dd91a9821873fc5f94339f0542ca0108/cpp/src/parquet/types.h#L575
>
> [2] https://github.com/apache/arrow/pull/35628#issuecomment-1583218617
> Best, Gang On Fri, Jun 16, 2023 at 3:58 AM Will Jones <
> will.jones...@gmail.com> <mailto:will.jones...@gmail.com> wrote:
>
> Cool. Thanks for doing that! On Thu, Jun 15, 2023 at 12:40 Benjamin
> Kietzman <bengil...@gmail.com <mailto:bengil...@gmail.com>
>
> wrote:
>
> I've added https://github.com/apache/arrow/issues/36112 to track
> deduplication of buffers on write. I don't think it would require
> modification of the IPC format. Ben On Thu, Jun 15, 2023 at 1:30 PM Matt
> Topol <zotthewiz...@gmail.com <mailto:zotthewiz...@gmail.com>
>
> wrote:
>
> Based on my understanding, in theory a buffer *could* be shared
>
> within
>
> a
>
> batch since the flatbuffers message just uses an offset and
>
> length
>
> to
>
> identify the buffers. That said, I don't believe any current
> implementation actually
>
> does
>
> this
>
> or
>
> takes advantage of this in any meaningful way. --Matt On Thu, Jun 15, 2023
> at 1:00 PM Will Jones <
>
> will.jones...@gmail.com <mailto:will.jones...@gmail.com>>
>
> wrote:
>
> Hi Ben, It's exciting to see this move along. The buffers will be
> duplicated. If buffer duplication is
>
> becomes
>
> a
>
> concern,
>
> I'd prefer to handle that in the ipc writer. Then buffers which are
> duplicated
>
> could
>
> be
>
> detected
>
> by checking pointer identity and written only once.
>
>  Question: to be able to write buffer only once and reference in
>
> multiple
>
> arrays, does that require a change to the IPC format? Or is
>
> sharing
>
> buffers
>
> within the same batch already allowed in the IPC format? Best, Will Jones
> On Thu, Jun 15, 2023 at 9:03 AM Benjamin Kietzman <
>
> bengil...@gmail.com <mailto:bengil...@gmail.com>
>
> wrote:
>
> Hello again all, The PR [1] to add string view to the format and the C++
>
> implementation
>
> is
>
> hovering around passing CI and has been undrafted.
>
> Furthermore,
>
> there
>
> is
>
> now also a PR [2] to add string view to the Go
>
> implementation.
>
> Code
>
> review
>
> is underway for each PR and I'd like to move toward a vote
>
> for
>
> acceptance-
>
> are there any other preliminaries which I've neglected? To reiterate the
> answers to some past questions: - Benchmarks are added in the C++ PR [1] to
> demonstrate the
>
> performance
>
> of
>
>  conversion between the various string formats. In addition,
>
> there
>
> are
>
>  some benchmarks which demonstrate the performance gains
>
> available
>
> with
>
>  the new format [3]. - Adding string view to the C ABI is a natural follow
> up, but
>
> should
>
> be
>
>  handled independently. An issue has been added to track
>
> that
>
>  enhancement [4]. Sincerely, Ben Kietzman [1]
> https://github.com/apache/arrow/pull/35628 [2]
> https://github.com/apache/arrow/pull/35769 [3]
>
> https://github.com/apache/arrow/pull/35628#issuecomment-1583218617
>
> [4] https://github.com/apache/arrow/issues/36099 On Wed, May 17, 2023 at
> 12:53 PM Benjamin Kietzman <
>
> bengil...@gmail.com <mailto:bengil...@gmail.com>>
>
> wrote:
>
> @Jacob
>
> You mention benchmarks multiple times, are these results
>
> published
>
> somewhere? I benchmarked the performance of raw pointer vs index
>
> offset
>
> views
>
> in
>
> my
>
> PR to velox, I do intend to port them to my arrow PR but I haven't
>
> gotten
>
> there
>
> yet.
>
> Furthermore, it seemed less urgent to me since coexistence of the two
> types
>
> in
>
> the
>
> c++
>
> implementation defers the question of how aggressively one should be
>
> preferred
>
> over
>
> the
>
> other. @Dewey
>
> I don't see the C Data interface in the PR
>
>  I have not addressed the C ABI in this PR. As you mention,
>
> it
>
> may
>
> be
>
> useful to transmit arrays with raw pointer views between implementations
> which
>
> allow
>
> them. I
>
> can address this in a follow up PR. @Will
>
> If I understand correctly, multiple arrays can reference
>
> the
>
> same
>
> buffers
>
> in memory, but once they are written to IPC their data

Re: [DISCUSS][Format] Draft implementation of string view array format

Reply via email to