Re: [DISCUSS][Format] Draft implementation of string view array format

Raphael Taylor-Davies Mon, 10 Jul 2023 09:49:54 -0700

For example, if someone (datafusion, velox, etc.) were to come up with a
framework for UDFs then would batches be passed in and out of those UDFs in
the Arrow format?

Yes, I think the arrow format is a perfect fit for this

Is Arrow meant to only be used in between systems (in this case query
engines) or is it also meant to be used in between components of a query
engine?

I can't speak for all query engines, but at least in the case ofDataFusion we exclusively use the Arrow format as the interchange formatbetween operators, including for UDFs. We have found that for mostoperators operating directly on the Arrow format is sufficientlyperformant to not represent a query bottleneck. For others, such asjoins, sorts and aggregates, we do make use of bespoke data structuresand formats internally, e.g. hash tables, row formats, etc..., but theoperator's public APIs are still in terms of arrow RecordBatch. We havefound this approach to perform very well, whilst also providing verygood modularity and composability.

In fact we are actually currently in the process of migrating theaggregation logic away from a bespoke mutable row representation to theArrow model, and are already seeing significant performanceimprovements, not to mention a significant reduction in code complexityand improved composability [1].

If every engine has its own bespoke formats internally
then it seems we are placing a limit on how far things can be decomposed.

Agreed, if engines choose to implement operations on bespoke formats,these operations will likely not be as interoperable as thoseimplemented using Arrow. To what extent an engine favours their ownformat(s) over Arrow will be an engineering trade-off they will have tomake, but DataFusion has found exclusively using Arrow as theinterchange format between operators to work well.

There are now multiple implementations of a query
engine and I think we are seeing just the edges of this query engine
decomposition (e.g. using arrow-c++'s datasets to feed DuckDb or consuming
a velox task as a record batch stream into a different system) and these
sorts of challenges are in the forefront.

I agree 100% that this sort of interoperability is what makes Arrow socompelling and something we should work very hard to preserve. This isthe crux of my concern with standardising alternative layouts. Idefinitely hope that with time Arrow will penetrate deeper into theseengines, perhaps in a similar manner to DataFusion, as opposed toprimarily existing at the surface-level.


[1]: https://github.com/apache/arrow-datafusion/pull/6800

On 10/07/2023 11:38, Weston Pace wrote:

The point I was trying to make, albeit very badly, was that these
operations are typically implemented using some sort of row format [1]
[2], and therefore their performance is not impacted by the array
representations. I think it is both inevitable, and in fact something to
be encouraged, that query engines will implement their own in-memory
layouts and data structures outside of the arrow specification for
specific operators, workloads, hardware, etc... This allows them to make
trade-offs based on their specific application domain, whilst also
ensuring that new ideas and approaches can continue to be incorporated
and adopted in the broader ecosystem. However, to then seek to
standardise these layouts seems to be both potentially unbounded scope
creep, and also somewhat counter productive if the goal of
standardisation is improved interoperability?

FWIW, I believe this formats are very friendly for row representation as
well, especially when stored as a payload (e.g. in a join).

For your more general point though I will ask the same question I asked on
the ArrayView discussion:

Is Arrow meant to only be used in between systems (in this case query
engines) or is it also meant to be used in between components of a query
engine?

For example, if someone (datafusion, velox, etc.) were to come up with a
framework for UDFs then would batches be passed in and out of those UDFs in
the Arrow format?  If every engine has its own bespoke formats internally
then it seems we are placing a limit on how far things can be decomposed.
 From the C++ perspective, I would personally like to see Arrow be usable
within components.  There are now multiple implementations of a query
engine and I think we are seeing just the edges of this query engine
decomposition (e.g. using arrow-c++'s datasets to feed DuckDb or consuming
a velox task as a record batch stream into a different system) and these
sorts of challenges are in the forefront.

On Fri, Jul 7, 2023 at 7:53 AM Raphael Taylor-Davies
<r.taylordav...@googlemail.com.invalid> wrote:

Thus the approach you
describe for validating an entire character buffer as UTF-8 then checking
offsets will be just as valid for Utf8View arrays as for Utf8 arrays.

The difference here is that it is perhaps expected for Utf8View to have
gaps in the underlying data that are not referenced as part of any
value, as I had understood this to be one of its benefits over the
current encoding. I think it would therefore be problematic to enforce
these gaps be UTF-8.

Furthermore unlike an explicit
selection vector a kernel may decide to copy and densify dynamically if

it

detects that output is getting sparse or fragmented

I don't see why you couldn't do something similar to materialize a
sparse selection vector, if anything being able to centralise this logic
outside specific kernels would be advantageous.

Specifically sorting and equality comparison
benefit significantly from the prefix comparison fast path,
so I'd anticipate that multi column sorting and aggregations would as

well

The point I was trying to make, albeit very badly, was that these
operations are typically implemented using some sort of row format [1]
[2], and therefore their performance is not impacted by the array
representations. I think it is both inevitable, and in fact something to
be encouraged, that query engines will implement their own in-memory
layouts and data structures outside of the arrow specification for
specific operators, workloads, hardware, etc... This allows them to make
trade-offs based on their specific application domain, whilst also
ensuring that new ideas and approaches can continue to be incorporated
and adopted in the broader ecosystem. However, to then seek to
standardise these layouts seems to be both potentially unbounded scope
creep, and also somewhat counter productive if the goal of
standardisation is improved interoperability? I fully expect in the next
5 years someone will come up with an even better way to encode strings
for some particular workload or hardware, do we then incorporate that as
well?

I guess it boils down to what matters to people more, interoperability
or best-in-class performance? Currently I think it is fair to say both
arrow and parquet favour interoperability over performance, aiming to
provide good enough performance broadly on the same order of magnitude
as a custom solution. I personally think this is the right engineering
trade-off, but appreciate opinions may differ. Ultimately I just really
want arrow to avoid the situation parquet has found itself in, where the
specification has both far outstripped the ability for the
implementations to keep pace, whilst simultaneously having standardised
approaches for things like delta encoding that are now considered
extremely sub-optimal for modern hardware.

That all being said I'm not against adding support for these arrays if
others are already onboard, I just wonder if inclusion in the primary
standard is really the right place for them. Perhaps some extension
mechanism might be the way to go here, potentially with some negotiation
mechanism, I'm not really sure... I will continue to think on this

Kind Regards,

Raphael

[1]:

https://duckdb.org/2021/08/27/external-sorting.html#binary-string-comparison
[2]: https://docs.rs/arrow-row/latest/arrow_row/

On 06/07/2023 17:47, Benjamin Kietzman wrote:

@Andrew:

Restricting these arrays to a single buffer will severely decrease their
utility. Since the character data is stored in multiple character buffers
writing Utf8View array can proceed without resizing allocations,
which is a major overhead when writing Utf8 arrays. Furthermore since the
character buffers have no restrictions on their size, it's

straightforward

to
reuse an existing buffer as a character buffer rather than always

allocating

a new one. In the case of creating an array which shares a lot of data

with

another (for example, appending some strings) we can reuse most of the
character buffers from the original. Finally Utf8View is well adapted for
efficiently wrapping non-arrow string data for ingestion by a kernel,

even

if the string data's full extent is not known ahead of time and is spread
across multiple non-contiguous buffers.

@Raphael:

branch on access

The branch-on-access is unavoidable since a primary feature of the

Utf8View

format is keeping short strings inline in the fixed width portion of

data.

It's worth noting that the inline prefix allows skipping the branch

entirely

for common cases of comparison, for example when the strings to be

compared

differ within the first 4 bytes.

In benchmarking (for example while building a hash table) I have not
observed
that this branch overly pessimizes access. Although I can't guarantee

every

Utf8View array will be more efficient than any Utf8 array, it is

certainly

faster for many relevant cases. Specifically sorting and equality

comparison

benefit significantly from the prefix comparison fast path,
so I'd anticipate that multi column sorting and aggregations would as

well.

If there are any other benchmarks which would help to justify Utf8View in
your
mind, I'd be happy to try writing them.

UTF-8 validation for StringArray can be done very efficiently by first

verifying the entire buffer, and then verifying the offsets correspond to
the start of a UTF-8 codepoint

For non-inlined strings, the character buffers do always contain the

entire

string's data and not just the last `len - 4` bytes. Thus the approach

you

describe for validating an entire character buffer as UTF-8 then checking
offsets will be just as valid for Utf8View arrays as for Utf8 arrays.

it does seem inconsistent to use unsigned types

It is indeed more typical for the arrow format to use signed integers for
offsets and other quantities. In this case there is prior art in other
engines with which we can remain compatible by using unsigned integers
instead. Since this is only a break with convention within the format and
shouldn't be difficult for any implementation to accommodate, I would

argue

that it's worthwhile to avoid pushing change onto existing implementers.

I presume that StringView will behave similarly to dictionaries in that

the selection kernels will not recompute the underlying value buffers.

The Utf8View format itself is not prescriptive of selection operations on
the
array; kernels are free to reuse character buffers (which produces an
implicit
selection vector) or to recompute them. Furthermore unlike an explicit
selection vector a kernel may decide to copy and densify dynamically if

it

detects that output is getting sparse or fragmented. It's also worth

noting

that unlike an explicit selection vector a Utf8View array (however

sparse or

fragmented) will still benefit from the prefix comparison fast path.

Sincerely,
Ben Kietzman

On Sun, Jul 2, 2023 at 8:01 AM Raphael Taylor-Davies
<r.taylordav...@googlemail.com.invalid>  wrote:

I would be interested in hearing some input from the Rust community.

   A couple of thoughts:

The variable number of buffers would definitely pose some challenges for
the Rust implementation, the closest thing we currently have is possibly
UnionArray, but even then the number of buffers is still determined
statically by the DataType. I therefore also wonder about the

possibility

of always having a single backing buffer that stores the character data,
including potentially a copy of the prefix. This would also avoid

forcing a

branch on access, which I would have expected to hurt performance for

some

kernels quite significantly.

Whilst not really a concern for Rust, which supports unsigned types, it
does seem inconsistent to use unsigned types where the rest of the

format

encourages the use of signed offsets, etc...

It isn't clearly specified whether a null should have a valid set of
offsets, etc... I think it is an important property of the current array
layouts that, with exception to dictionaries, the data in null slots is
arbitrary, i.e. can take any value, but not undefined. This allows for
separate handling of the null mask and values, which can be important

for

some kernels and APIs.

More an observation than an issue, but UTF-8 validation for StringArray
can be done very efficiently by first verifying the entire buffer, and

then

verifying the offsets correspond to the start of a UTF-8 codepoint. This
same approach would not be possible for StringView, which would need to
verify individual values and would therefore be significantly more
expensive. As it is UB for a Rust string to contain non-UTF-8 data, this
validation is perhaps more important for Rust than for other languages.

I presume that StringView will behave similarly to dictionaries in that
the selection kernels will not recompute the underlying value buffers. I
think this is fine, but it is perhaps worth noting this has caused
confusion in the past, as people somewhat reasonably expect an array
post-selection to have memory usage reflecting the smaller selection.

This

is then especially noticeable if the data is written out to IPC, and

still

contains data that was supposedly filtered out. My 2 cents is that

explicit

selection vectors are a less surprising way to defer selection than

baking

it into the array, but I also don't have any workloads where this is the
major bottleneck so can't speak authoritatively here.

Which leads on to my major concern with this proposal, that it adds
complexity and cognitive load to the specification and implementations,
whilst not meaningfully improving the performance of the operators that

commonly encounter as performance bottlenecks, which are multi-column

sorts

and aggregations, or the expensive string operations such as matching or
parsing. If we didn't already have a string representation I would be

more

onboard, but as it stands I'm definitely on the fence, especially given
selection performance can be improved in less intrusive ways using
dictionaries or selection vectors.

Kind Regards,

Raphael Taylor-Davies

On 02/07/2023 11:46, Andrew Lamb wrote:

   * This is the first layout where the number of buffers depends on the

data

and not the schema. I think this is the most architecturally significant
fact. I

   I have spent some time reading the initial proposal -- thank you for
that. I now understand what Weston was saying about the "variable

numbers

of buffers". I wonder if you considered restricting such arrays to a

single

buffer (so as to make them more similar to other arrow array types that
have a fixed number of buffers)? On Tue, Jun 20, 2023 at 11:33 AM Weston
Pace<weston.p...@gmail.com>  <mailto:weston.p...@gmail.com>  wrote:

Before I say anything else I'll say that I am in favor of this new

layout.

There is some existing literature on the idea (e.g. umbra) and your
benchmarks show some nice improvements. Compared to some of the other
layouts we've discussed recently (REE, list veiw) I do think this

layout is

more unique and fundamentally different. Perhaps most fundamentally
different: * This is the first layout where the number of buffers

depends

on the data and not the schema. I think this is the most architecturally
significant fact. It does require a (backwards compatible) change to the
IPC format itself, beyond just adding new type codes. It also poses
challenges in places where we've assumed there will be at most 3 buffers
(e.g. in ArraySpan, though, as you have shown, we can work around this
using a raw pointers representation internally in those spots). I think
you've done some great work to integrate this well with Arrow-C++ and

I'm

convinced it can work. I would be interested in hearing some input from

the

Rust community. Ben, at one point there was some discussion that this

might

be a c-data only type. However, I believe that was based on the raw
pointers representation. What you've proposed here, if I understand
correctly, is an index + offsets representation and it is suitable for

IPC

correct? (e.g. I see that you have changes and examples in the IPC
reader/writer) On Mon, Jun 19, 2023 at 7:17 AM Benjamin Kietzman <
bengil...@gmail.com> <mailto:bengil...@gmail.com>  wrote:

Hi Gang, I'm not sure what you mean, sorry if my answers are off base:
Parquet's ByteArray will be unaffected by the addition of the string

view

type; all arrow strings (arrow::Type::STRING,arrow::Type::LARGE_STRING,
and with this patcharrow::Type::STRING_VIEW) are converted to ByteArrays
during serialization to parquet [1]. If you mean that encoding of
arrow::Type::STRING_VIEW  will not be as fast as encoding of equivalent
arrow::Type::STRING, that's something I haven't benchmarked so I can't
answer definitively. I would expect it to be

faster

than first converting STRING_VIEW->STRING then encoding to parquet;

direct

encoding avoids allocating and populating temporary buffers. Of course

this

only applies to cases where you need to encode an array of STRING_VIEW

to

parquet- encoding of STRING to parquet will be unaffected. Sincerely,

Ben

[1]

https://github.com/bkietz/arrow/blob/46cf7e67766f0646760acefa4d2d01cdfead2d5d/cpp/src/parquet/encoding.cc#L166-L179

   On Thu, Jun 15, 2023 at 10:34 PM Gang Wu<ust...@gmail.com>  <mailto:

ust...@gmail.com>  wrote:

Hi Ben, The posted benchmark [1] looks pretty good to me. However, I

want

to raise a possible issue from the perspective of parquet-cpp.

Parquet-cpp

uses a customizedparquet::ByteArray  type [2] for string/binary, I

would

expect some regression of conversions between parquet reader/writer and
the proposed string view array, especially when some strings use short

form

and others use long form. [1]

https://github.com/apache/arrow/blob/41309de8dd91a9821873fc5f94339f0542ca0108/cpp/src/parquet/types.h#L575

[2]https://github.com/apache/arrow/pull/35628#issuecomment-1583218617
Best, Gang On Fri, Jun 16, 2023 at 3:58 AM Will Jones <
will.jones...@gmail.com> <mailto:will.jones...@gmail.com>  wrote:

Cool. Thanks for doing that! On Thu, Jun 15, 2023 at 12:40 Benjamin
Kietzman <bengil...@gmail.com  <mailto:bengil...@gmail.com>

wrote:

I've addedhttps://github.com/apache/arrow/issues/36112  to track
deduplication of buffers on write. I don't think it would require
modification of the IPC format. Ben On Thu, Jun 15, 2023 at 1:30 PM Matt
Topol <zotthewiz...@gmail.com  <mailto:zotthewiz...@gmail.com>

wrote:

Based on my understanding, in theory a buffer *could* be shared

within

a

batch since the flatbuffers message just uses an offset and

length

to

identify the buffers. That said, I don't believe any current
implementation actually

does

this

or

takes advantage of this in any meaningful way. --Matt On Thu, Jun 15,

at 1:00 PM Will Jones <

will.jones...@gmail.com  <mailto:will.jones...@gmail.com>>

wrote:

Hi Ben, It's exciting to see this move along. The buffers will be
duplicated. If buffer duplication is

becomes

a

concern,

I'd prefer to handle that in the ipc writer. Then buffers which are
duplicated

could

be

detected

by checking pointer identity and written only once.

   Question: to be able to write buffer only once and reference in

multiple

arrays, does that require a change to the IPC format? Or is

sharing

buffers

within the same batch already allowed in the IPC format? Best, Will

Jones

On Thu, Jun 15, 2023 at 9:03 AM Benjamin Kietzman <

bengil...@gmail.com  <mailto:bengil...@gmail.com>

wrote:

Hello again all, The PR [1] to add string view to the format and the C++

implementation

is

hovering around passing CI and has been undrafted.

Furthermore,

there

is

now also a PR [2] to add string view to the Go

implementation.

Code

review

is underway for each PR and I'd like to move toward a vote

for

acceptance-

are there any other preliminaries which I've neglected? To reiterate the
answers to some past questions: - Benchmarks are added in the C++ PR

[1] to

demonstrate the

performance

of

   conversion between the various string formats. In addition,

there

are

   some benchmarks which demonstrate the performance gains

available

with

   the new format [3]. - Adding string view to the C ABI is a natural

follow

up, but

should

be

   handled independently. An issue has been added to track

that

   enhancement [4]. Sincerely, Ben Kietzman [1]
https://github.com/apache/arrow/pull/35628  [2]
https://github.com/apache/arrow/pull/35769  [3]

https://github.com/apache/arrow/pull/35628#issuecomment-1583218617

[4]https://github.com/apache/arrow/issues/36099  On Wed, May 17, 2023

at

12:53 PM Benjamin Kietzman <

bengil...@gmail.com  <mailto:bengil...@gmail.com>>

wrote:

@Jacob

You mention benchmarks multiple times, are these results

published

somewhere? I benchmarked the performance of raw pointer vs index

offset

views

in

my

PR to velox, I do intend to port them to my arrow PR but I haven't

gotten

there

yet.

Furthermore, it seemed less urgent to me since coexistence of the two
types

in

the

c++

implementation defers the question of how aggressively one should be

preferred

over

the

other. @Dewey

I don't see the C Data interface in the PR

   I have not addressed the C ABI in this PR. As you mention,

it

may

be

useful to transmit arrays with raw pointer views between implementations
which

allow

them. I

can address this in a follow up PR. @Will

If I understand correctly, multiple arrays can reference

the

same

buffers

in memory, but once they are written to IPC their data

Re: [DISCUSS][Format] Draft implementation of string view array format

Reply via email to