Re: [DISCUSS][Format] Draft implementation of string view array format

Jacob Wujciak Tue, 16 May 2023 16:43:07 -0700

Hello Everyone,
I think keeping interoperability with the large ecosystem is a very
important goal for arrow so I am overall in favor of this proposal!


You mention benchmarks multiple times, are these results published
somewhere?

Thanks

On Tue, May 16, 2023 at 11:39 PM Benjamin Kietzman <[email protected]>
wrote:

> Hello all,
>
> As previously discussed on this list [1], an UmbraDB/DuckDB/Velox
> compatible
> "string view" type could bring several performance benefits to access and
> authoring of string data in the arrow format [2]. Additionally better
> interoperability with engines already using this format could be
> established.
>
> PR #0 [3] adds Utf8View and BinaryView types to the C++ implementation and
> to
> the IPC format. For the purposes of IPC raw pointers are not used. Instead,
> each view contains a pair of 32 bit unsigned integers which encode the
> index of
> a character buffer (string view arrays may consist of a variable number of
> such buffers) and the offset of a view's data within that buffer
> respectively.
> Benefits of this substitution include:
> - This makes explicit the guarantee that lifetime of all character data is
> equal
>   to that of the array which views it, which is critical for confident
>   consumption across an interface boundary.
> - As with other types in the arrow format, such arrays are serializable and
>   venue agnostic; directly usable in shared memory without modification.
> - Indices and offsets are easily validated.
>
> Accessing the data requires some trivial pointer arithmetic, but in
> benchmarking
> this had negligible impact on sequential access and only minor impact on
> random
> access.
>
> In the C++ implementation, raw pointer string views are supported as an
> extended
> case of the Utf8View type: `utf8_view(/*has_raw_pointers=*/true)`.
> Branching on
> this access pattern bit at the data type level has negligible impact on
> performance since the branch resides outside any hot loops. Utility
> functions
> are provided for efficient (potentially in-place) conversion between raw
> pointer
> and index offset views. For example, the C++ implementation could zero copy
> a raw pointer array from Velox, filter it, then convert to index/offset for
> serialization. Other implementations may choose to accommodate or eschew
> raw
> pointer views as their communities direct.
>
> Where desirous in a rigorously controlled context this still enables
> construction
> and safe consumption of string view arrays which reference memory not
> directly bound to the lifetime of the array. I'm not sure when or if we
> would
> find it useful to have arrays like this; I do not introduce any in [3]. I
> mention
> this possibility to highlight that if benchmarking demonstrates that such
> an
> approach brings a significant performance benefit to some operation, the
> only
> barrier to its adoption would be code review. Likewise if more intensive
> benchmarking determines that raw pointer views critically outperform
> index/offset
> views for real-world analytics tasks, prioritizing raw pointer string views
> for usage within the C++ implementation will be straightforward.
>
> See also the proposal to Velox that their string view vector be refactored
> in a similar vein [4].
>
> Sincerely,
> Ben Kietzman
>
> [1] https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
> [2] http://cidrdb.org/cidr2020/papers/p29-neumann-cidr20.pdf
> [3] https://github.com/apache/arrow/pull/35628
> [4] https://github.com/facebookincubator/velox/discussions/4362
>

Re: [DISCUSS][Format] Draft implementation of string view array format

Reply via email to